Wikitech labswiki https://wikitech.wikimedia.org/wiki/Main_Page MediaWiki 1.45.0-wmf.4 first-letter Media Special Talk User User talk Wikitech Wikitech talk File File talk MediaWiki MediaWiki talk Template Template talk Help Help talk Category Category talk Obsolete Obsolete talk OfficeIT OfficeIT talk Tool Tool talk Nova Resource Nova Resource Talk Heira Heira Talk TimedText TimedText talk Module Module talk Deployments 0 4108 2309635 2309602 2025-06-08T22:20:57Z ScheduleDeploymentBot 37566 Add [[gerrit:1130201]] to Monday, June 09 UTC morning backport window 2309635 wikitext text/x-wiki {{Navigation MediaWiki deployment}} This page tracks '''upcoming''' '''deployments''' of software to the [[m:Special:SiteMatrix|Wikimedia Foundation servers]]. == Getting started == Ensure you joined the {{irc|wikimedia-operations}} IRC channel as all deployment-related communications happen there. If you need help, contact [[mw:Wikimedia Release Engineering Team|Release Engineering]] on IRC at {{irc|wikimedia-releng}}; and ping Tyler (<code>thcipriani</code>). * '''MediaWiki is deployed weekly''' through the [[/Train|Deployment Train]]. Other services follow their own schedule. * '''Times are pinned to San Francisco''', thus the UTC time changes in March and November per [[:en:Daylight saving time in the United States|DST]]. * '''Prefer regular [[Backport windows]]''' over adding new windows. To request deployment of a config change or backport, add your username and Gerrit URL to one of the backport windows on this page. You must be online in #wikimedia-operations on IRC during your deployment and install [[WikimediaDebug]] ahead of time. The #wikimedia-operations channel requires you to [[m:IRC/Instructions#Register your nickname, identify, and enforce|register your nickname]] before you can join. ** You can use the '''[https://schedule-deployment.toolforge.org/ backport scheduling tool]''' to more easily edit this page. * Tasks that meet [[/Inclusion criteria|Inclusion criteria]] '''require their own windows''', which includes long-running tasks. '''Schedule more time''' than you think you need to account for delays and set backs, we recommend one hour for most tasks. **To create or modify a recurring deploy window, send a patchset to [[gitlab:repos/releng/release/-/blob/main/make-deployment-calendar/deployments-calendar.yaml|deployments-calendar.yaml file]] in <code>repos/releng/release.git</code>. **To create an one-off window, simply edit this page accordingly ** '''Announce''' changes to the [[mail:ops|ops mailing list]] ahead of time if you anticipate or are uncertain about noticeable impacts to database load, HTTP caching, or the introduction of new cookies. ** '''Announce''' deployments of major features to the community via [[meta:Tech/News/Next|Tech News]] and/or via other [[mediawikiwiki:Wikimedia_Product_Guidance/Communication_channels|Product communication channels]]. * '''Something went wrong?''' See [[Incident response]]. Is there a user-impacting problem? Communicate in the {{irc|wikimedia-operations}} IRC channel. If there is a Phabricator task, ensure [[phab:tag/wikimedia-incident/|#Wikimedia-Incident]] is tagged, and consider setting the [[mw:Phabricator/Project_management#Priority_levels|Unbreak Now]] priority. __TOC__ {{anchor|Next Week|Near Term|Near term|Near-term}}{{clear}} [[Category:Deployment]] {{Note|content=Subscribe in Google Calendar via <code>wikimedia.org_rudis09ii2mm5fk4hgdjeh1u64@group.calendar.google.com</code>.<br>This may not include one-off windows. '''If there are differences, then the wiki page is canonical and correct'''.}} ==Week of June 09== ==={{Deployment_day|date=2025-06-08}}=== {{Deployment calendar event card |when=2025-06-08 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} ==={{Deployment_day|date=2025-06-09}}=== {{Deployment calendar event card |when=2025-06-09 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|sergi0|Sergio Gimeno}} {{deploy|type=config|gerrit=1154282|title=[beta] GrowthExperiments: enable limiting add a link task via config|status=}} - {{phabricator|T393769}} {{phabricator|T395383}} {{phabricator|T393923}} {{ircnick|-|Umherirrender}} {{deploy|type=config|gerrit=1130201|title=Improve function and property documentation for php code|status=}} - {{phabricator|T171115}} {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-09 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-09 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|MatmaRex|Bartosz}} {{deploy|type=config|gerrit=1153363|title=logging: Allow sampling of Logstash logs|status=}} - {{phabricator|T395967}} {{deploy|type=config|gerrit=1153364|title=logging: Sample some high-volume log streams|status=}} - {{phabricator|T394402}} {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-09 08:30 SF |length=0.5 |window=Wikimedia Portals Update |who={{ircnick|jan_drewniak|Jan Drewniak}} |what=Weekly window for the portals page: https://www.wikipedia.org/ }} {{Deployment calendar event card |when=2025-06-09 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-09 10:00 SF |length=0.5 |window=Wikidata Query Service weekly deploy |who={{ircnick|ryankemper|Ryan}} |what=... }} {{Deployment calendar event card |when=2025-06-09 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|sd|sd}} {{deploy|type=config|gerrit=1144484|title=Replace deprecated wgCirrusSearchWMFExtraFeatures with wgCirrusSearchWeightedTags|status=}} - {{phabricator|T393872}} {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-09 14:00 SF |length=2 |window=Weekly Security deployment window |who={{ircnick|Reedy|Sam}}, {{ircnick|sbassett|Scott}}, {{ircnick|Maryum|Maryum}}, {{ircnick|manfredi|Manfredi}} |what=Held deployment window for Security-team related deploys. }} {{Deployment calendar event card |when=2025-06-09 16:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-09 19:00 SF |length=1 |window=Automatic branching of MediaWiki, extensions, skins, and vendor – see [[Heterogeneous_deployment/Train_deploys]] |who=N/A |what=Branch <code>wmf/1.45.0-wmf.5</code> }} {{Deployment calendar event card |when=2025-06-09 20:00 SF |length=1 |window=Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see [[Heterogeneous_deployment/Train_deploys]] |who=N/A |what=Deploy <code>wmf/1.45.0-wmf.5</code> to testwikis }} {{Deployment calendar event card |when=2025-06-09 21:00 SF |length=1 |window=Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) |who=N/A |what=Runs <code>scap clean auto</code> }} {{Deployment calendar event card |when=2025-06-09 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-09 23:00 SF |length=0.5 |window=Primary database switchover |who={{ircnick|marostegui|Manuel Arostegui}}, {{ircnick|Amir1|Amir}}, {{ircnick|federico3|Federico Ceratto}} |what=Held deployment window for database primary masters maintenance }} ==={{Deployment_day|date=2025-06-10}}=== {{Deployment calendar event card |when=2025-06-10 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-10 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-10 05:00 SF |length=1 |window=Mobileapps/RESTBase/Wikifeeds |who=Content Transform Team |what=Content transform team node services (mobileapps/wikifeeds) }} {{Deployment calendar event card |when=2025-06-10 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-10 08:00 SF |length=1 |window=SRE Collaboration Services office hours |who={{ircnick|jelto|Jelto}}, {{ircnick|arnoldokoth|Arnold}}, {{ircnick|mutante|Daniel}} |what=Services including Gerrit, Phorge (Phabricator), GitLab }} {{Deployment calendar event card |when=2025-06-10 09:00 SF |length=1 |window=[[Puppet request window]]<br/><small>'''(Max 6 patches)'''</small> |who={{ircnick|jhathaway|JHathaway}}, {{ircnick|moritzm|Moritz}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to Puppet change'' }} {{Deployment calendar event card |when=2025-06-10 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-10 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.4->1.45.0-wmf.5|1.45.0-wmf.4|1.45.0-wmf.4}} * group0 to [[mw:MediaWiki_1.45/wmf.5|1.45.0-wmf.5]] * '''Blockers: {{phabricator|T392175}}''' }} {{Deployment calendar event card |when=2025-06-10 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|bwang|bwang}} {{deploy|type=config|gerrit=1154057|title=Enable empty search recommendations for Vector on all wikipedias, and for Minerva on group1 wikis and wikivoyage|status=}} - {{phabricator|T395344}} {{phabricator|T395339}} {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-10 14:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-10 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} ==={{Deployment_day|date=2025-06-11}}=== {{Deployment calendar event card |when=2025-06-11 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-11 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-11 04:00 SF |length=1 |window=[[mw:Services|Services]] – [[Citoid]] / [[Zotero]] |who=Marielle ({{ircnick|mvolz}}) |what=See [[mw:Citoid|Citoid]] }} {{Deployment calendar event card |when=2025-06-11 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-11 07:00 SF |length=1 |window=Wikifunctions Services UTC Afternoon |who=Abstract Wikipedia team (Africa, Europe, Eastern Americas) |what=Wikifunctions back-end k8s services }} {{Deployment calendar event card |when=2025-06-11 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-11 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.5|1.45.0-wmf.4->1.45.0-wmf.5|1.45.0-wmf.4}} * group1 to [[mw:MediaWiki_1.45/wmf.5|1.45.0-wmf.5]] * '''Blockers: {{phabricator|T392175}}''' }} {{Deployment calendar event card |when=2025-06-11 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-11 14:00 SF |length=1 |window=Wikifunctions Services UTC Late |who=Abstract Wikipedia team (North and South America) |what=Wikifunctions back-end k8s services }} {{Deployment calendar event card |when=2025-06-11 15:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-11 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-11 23:00 SF |length=0.5 |window=Primary database switchover |who={{ircnick|marostegui|Manuel Arostegui}}, {{ircnick|Amir1|Amir}}, {{ircnick|federico3|Federico Ceratto}} |what=Held deployment window for database primary masters maintenance }} ==={{Deployment_day|date=2025-06-12}}=== {{Deployment calendar event card |when=2025-06-12 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-12 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-12 05:00 SF |length=1 |window=Mobileapps/RESTBase/Wikifeeds |who=Content Transform Team |what=Content transform team node services (mobileapps/wikifeeds) }} {{Deployment calendar event card |when=2025-06-12 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-12 08:00 SF |length=1 |window=Train log triage |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=See [[Heterogeneous_deployment/Train_deploys#Breakage]] }} {{Deployment calendar event card |when=2025-06-12 09:00 SF |length=1 |window=[[Puppet request window]]<br/><small>'''(Max 6 patches)'''</small> |who={{ircnick|jhathaway|JHathaway}}, {{ircnick|moritzm|Moritz}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to Puppet change'' }} {{Deployment calendar event card |when=2025-06-12 10:00 SF |length=1 |window=Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) |who={{ircnick|bd808}} |what=... }} {{Deployment calendar event card |when=2025-06-12 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-12 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.5|1.45.0-wmf.5|1.45.0-wmf.4->1.45.0-wmf.5}} * group2 to [[mw:MediaWiki_1.45/wmf.5|1.45.0-wmf.5]] * '''Blockers: {{phabricator|T392175}}''' }} {{Deployment calendar event card |when=2025-06-12 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-12 14:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-12 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} ==={{Deployment_day|date=2025-06-13}}=== {{Deployment calendar event card |when=2025-06-13 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} {{Deployment calendar event card |when=2025-06-13 04:00 SF |length=0.5 |window=GitLab version upgrades |who={{ircnick|jelto|Jelto}}, {{ircnick|arnoldokoth|Arnold}}, {{ircnick|mutante|Daniel}} |what=GitLab version upgrades }} ==={{Deployment_day|date=2025-06-14}}=== {{Deployment calendar event card |when=2025-06-14 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} ==Week of June 16== ==={{Deployment_day|date=2025-06-15}}=== {{Deployment calendar event card |when=2025-06-15 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} ==={{Deployment_day|date=2025-06-16}}=== {{Deployment calendar event card |when=2025-06-16 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-16 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-16 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-16 08:30 SF |length=0.5 |window=Wikimedia Portals Update |who={{ircnick|jan_drewniak|Jan Drewniak}} |what=Weekly window for the portals page: https://www.wikipedia.org/ }} {{Deployment calendar event card |when=2025-06-16 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-16 10:00 SF |length=0.5 |window=Wikidata Query Service weekly deploy |who={{ircnick|ryankemper|Ryan}} |what=... }} {{Deployment calendar event card |when=2025-06-16 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-16 14:00 SF |length=2 |window=Weekly Security deployment window |who={{ircnick|Reedy|Sam}}, {{ircnick|sbassett|Scott}}, {{ircnick|Maryum|Maryum}}, {{ircnick|manfredi|Manfredi}} |what=Held deployment window for Security-team related deploys. }} {{Deployment calendar event card |when=2025-06-16 16:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-16 19:00 SF |length=1 |window=Automatic branching of MediaWiki, extensions, skins, and vendor – see [[Heterogeneous_deployment/Train_deploys]] |who=N/A |what=Branch <code>wmf/1.45.0-wmf.6</code> }} {{Deployment calendar event card |when=2025-06-16 20:00 SF |length=1 |window=Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see [[Heterogeneous_deployment/Train_deploys]] |who=N/A |what=Deploy <code>wmf/1.45.0-wmf.6</code> to testwikis }} {{Deployment calendar event card |when=2025-06-16 21:00 SF |length=1 |window=Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) |who=N/A |what=Runs <code>scap clean auto</code> }} {{Deployment calendar event card |when=2025-06-16 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-16 23:00 SF |length=0.5 |window=Primary database switchover |who={{ircnick|marostegui|Manuel Arostegui}}, {{ircnick|Amir1|Amir}}, {{ircnick|federico3|Federico Ceratto}} |what=Held deployment window for database primary masters maintenance }} ==={{Deployment_day|date=2025-06-17}}=== {{Deployment calendar event card |when=2025-06-17 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-17 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-17 05:00 SF |length=1 |window=Mobileapps/RESTBase/Wikifeeds |who=Content Transform Team |what=Content transform team node services (mobileapps/wikifeeds) }} {{Deployment calendar event card |when=2025-06-17 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-17 08:00 SF |length=1 |window=SRE Collaboration Services office hours |who={{ircnick|jelto|Jelto}}, {{ircnick|arnoldokoth|Arnold}}, {{ircnick|mutante|Daniel}} |what=Services including Gerrit, Phorge (Phabricator), GitLab }} {{Deployment calendar event card |when=2025-06-17 09:00 SF |length=1 |window=[[Puppet request window]]<br/><small>'''(Max 6 patches)'''</small> |who={{ircnick|jhathaway|JHathaway}}, {{ircnick|moritzm|Moritz}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to Puppet change'' }} {{Deployment calendar event card |when=2025-06-17 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-17 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.5->1.45.0-wmf.6|1.45.0-wmf.5|1.45.0-wmf.5}} * group0 to [[mw:MediaWiki_1.45/wmf.6|1.45.0-wmf.6]] * '''Blockers: {{phabricator|T392176}}''' }} {{Deployment calendar event card |when=2025-06-17 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-17 14:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-17 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} ==={{Deployment_day|date=2025-06-18}}=== {{Deployment calendar event card |when=2025-06-18 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-18 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-18 04:00 SF |length=1 |window=[[mw:Services|Services]] – [[Citoid]] / [[Zotero]] |who=Marielle ({{ircnick|mvolz}}) |what=See [[mw:Citoid|Citoid]] }} {{Deployment calendar event card |when=2025-06-18 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-18 07:00 SF |length=1 |window=Wikifunctions Services UTC Afternoon |who=Abstract Wikipedia team (Africa, Europe, Eastern Americas) |what=Wikifunctions back-end k8s services }} {{Deployment calendar event card |when=2025-06-18 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-18 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.6|1.45.0-wmf.5->1.45.0-wmf.6|1.45.0-wmf.5}} * group1 to [[mw:MediaWiki_1.45/wmf.6|1.45.0-wmf.6]] * '''Blockers: {{phabricator|T392176}}''' }} {{Deployment calendar event card |when=2025-06-18 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-18 14:00 SF |length=1 |window=Wikifunctions Services UTC Late |who=Abstract Wikipedia team (North and South America) |what=Wikifunctions back-end k8s services }} {{Deployment calendar event card |when=2025-06-18 15:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-18 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-18 23:00 SF |length=0.5 |window=Primary database switchover |who={{ircnick|marostegui|Manuel Arostegui}}, {{ircnick|Amir1|Amir}}, {{ircnick|federico3|Federico Ceratto}} |what=Held deployment window for database primary masters maintenance }} ==={{Deployment_day|date=2025-06-19}}=== {{Deployment calendar event card |when=2025-06-19 00:00 SF |length=1 |window=[[Backport windows|UTC morning backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Amir1|Amir}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|awight|Adam}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-19 03:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC mid-day) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-19 05:00 SF |length=1 |window=Mobileapps/RESTBase/Wikifeeds |who=Content Transform Team |what=Content transform team node services (mobileapps/wikifeeds) }} {{Deployment calendar event card |when=2025-06-19 06:00 SF |length=1 |window=[[Backport windows|UTC afternoon backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|Lucas_WMDE|Lucas}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-19 08:00 SF |length=1 |window=Train log triage |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=See [[Heterogeneous_deployment/Train_deploys#Breakage]] }} {{Deployment calendar event card |when=2025-06-19 09:00 SF |length=1 |window=[[Puppet request window]]<br/><small>'''(Max 6 patches)'''</small> |who={{ircnick|jhathaway|JHathaway}}, {{ircnick|moritzm|Moritz}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to Puppet change'' }} {{Deployment calendar event card |when=2025-06-19 10:00 SF |length=1 |window=Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) |who={{ircnick|bd808}} |what=... }} {{Deployment calendar event card |when=2025-06-19 10:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC late) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} {{Deployment calendar event card |when=2025-06-19 11:00 SF |length=2 |window=MediaWiki train - Utc-7 Version |who={{ircnick|thcipriani|Tyler}}, {{ircnick|thcipriani|Tyler}} |what=[[mw:MediaWiki 1.45/Roadmap#Schedule for the deployments|1.45 schedule]] {{DeployOneWeekMini|1.45.0-wmf.6|1.45.0-wmf.6|1.45.0-wmf.5->1.45.0-wmf.6}} * group2 to [[mw:MediaWiki_1.45/wmf.6|1.45.0-wmf.6]] * '''Blockers: {{phabricator|T392176}}''' }} {{Deployment calendar event card |when=2025-06-19 13:00 SF |length=1 |window=[[Backport windows|UTC late backport window]]<br/><small>'''Your patch may or may not be deployed at the sole discretion of the deployer'''</small> |who={{ircnick|RoanKattouw|Roan}}, {{ircnick|Urbanecm|Martin}}, {{ircnick|TheresNoTime|Sammy}}, {{ircnick|kindrobot|Stef}}, {{ircnick|cjming|Clare}} |what= {{ircnick|irc-nickname|Requesting Developer}} * ''Gerrit link to backport or config change'' }} {{Deployment calendar event card |when=2025-06-19 14:00 SF |length=1 |window=Web Team deployment window |who=Web Team |what=NOTE: often skipped, the web team does not typically check IRC so assume this is not being used if 5 minutes past the start }} {{Deployment calendar event card |when=2025-06-19 23:00 SF |length=1 |window=[[MediaWiki_On_Kubernetes#How_to_manage_changes_to_the_infrastructure|MediaWiki infrastructure]] (UTC early) |who=SRE team |what=MediaWiki-related infrastructure changes that need a kubernetes deployment. }} ==={{Deployment_day|date=2025-06-20}}=== {{Deployment calendar event card |when=2025-06-20 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} {{Deployment calendar event card |when=2025-06-20 04:00 SF |length=0.5 |window=GitLab version upgrades |who={{ircnick|jelto|Jelto}}, {{ircnick|arnoldokoth|Arnold}}, {{ircnick|mutante|Daniel}} |what=GitLab version upgrades }} ==={{Deployment_day|date=2025-06-21}}=== {{Deployment calendar event card |when=2025-06-21 00:00 SF |length=24 |window=No deploys all day! See [[Deployments/Emergencies]] if things are broken. |who= |what=No Deploys }} hwkx2awf4g3ypwcyegqy9s44ssugsv6 Server Admin Log 0 7919 2309625 2309619 2025-06-08T12:04:59Z Stashbot 7414 Ammar: Ran fixStuckGlobalRename.php for T396290 and T396291 2309625 wikitext text/x-wiki == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> fwmyyrb7rbihddlktsyf5wwayq0vw8l 2309637 2309625 2025-06-09T05:00:02Z Stashbot 7414 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet 2309637 wikitext text/x-wiki == 2025-06-09 == * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 01axf17u8u3knm1i0mdy5ym8kf5buhx 2309638 2309637 2025-06-09T05:00:06Z Stashbot 7414 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 2309638 wikitext text/x-wiki == 2025-06-09 == * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 726xnjuyhjgzc6aqh78n0rlz5al989z 2309639 2309638 2025-06-09T05:00:34Z Stashbot 7414 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 2309639 wikitext text/x-wiki == 2025-06-09 == * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> i3g0o9emtgizfwt46cvp8qfn6d118v8 2309640 2309639 2025-06-09T05:24:54Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled T393989', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json 2309640 wikitext text/x-wiki == 2025-06-09 == * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> avsif1xpccijuan0wazqwpwzgfpp4by 2309641 2309640 2025-06-09T05:37:28Z Stashbot 7414 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning 2309641 wikitext text/x-wiki == 2025-06-09 == * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> tff22xocaes1cgtapi4swii5uxbdnir 2309642 2309641 2025-06-09T05:42:54Z Stashbot 7414 marostegui: Add MariaDB 10.11.13 to the repo T395663 2309642 wikitext text/x-wiki == 2025-06-09 == * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 0xguk918urabuqkqxfol8ycj7h1qmcv 2309644 2309642 2025-06-09T06:22:55Z Stashbot 7414 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning 2309644 wikitext text/x-wiki == 2025-06-09 == * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 9ag288u1q6n7cddkhxu1e35j9bj48zz 2309645 2309644 2025-06-09T07:23:02Z Stashbot 7414 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning 2309645 wikitext text/x-wiki == 2025-06-09 == * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> c7e6ahbjd02axwa089prntqae2dl1ej 2309646 2309645 2025-06-09T07:23:55Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance 2309646 wikitext text/x-wiki == 2025-06-09 == * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> mtuinw3yu8k2hc7ybeks7zj7c4tuxuz 2309647 2309646 2025-06-09T07:28:56Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance 2309647 wikitext text/x-wiki == 2025-06-09 == * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 961b43xub0ai0pt3k93ylxidhiuen6l 2309648 2309647 2025-06-09T07:33:57Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance 2309648 wikitext text/x-wiki == 2025-06-09 == * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 4ds47jsp2agbuauvozxce4wrt5gig3n 2309649 2309648 2025-06-09T07:34:06Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json 2309649 wikitext text/x-wiki == 2025-06-09 == * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 2urwbf4rvgyj6ebwxgcfoamnjwi3imx 2309651 2309649 2025-06-09T07:41:15Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json 2309651 wikitext text/x-wiki == 2025-06-09 == * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 5cdpqhtxk0xfo3vogiasl67ux4hi8co 2309653 2309651 2025-06-09T07:56:20Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json 2309653 wikitext text/x-wiki == 2025-06-09 == * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 2b5nfr5ygb0h77gsknintp91xnsa7aa 2309655 2309653 2025-06-09T08:08:29Z Stashbot 7414 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning 2309655 wikitext text/x-wiki == 2025-06-09 == * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 0deiklcglupupwysf2mj0agvm8tlpzu 2309656 2309655 2025-06-09T08:08:31Z Stashbot 7414 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet 2309656 wikitext text/x-wiki == 2025-06-09 == * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> grdg0qgenhsd1ltw0839xls3zjmq0yt 2309657 2309656 2025-06-09T08:11:27Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json 2309657 wikitext text/x-wiki == 2025-06-09 == * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 3o3wmp6tilqatnh6yl9d93dhvbn6q1n 2309662 2309657 2025-06-09T08:26:42Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 (T396130)', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json 2309662 wikitext text/x-wiki == 2025-06-09 == * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> e64sp5h320wda8rillgrppaxdjh0682 2309663 2309662 2025-06-09T08:26:49Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance 2309663 wikitext text/x-wiki == 2025-06-09 == * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> s7i8qwcmpguu42gy9hy5ai7pwg7q9tm 2309664 2309663 2025-06-09T08:26:57Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json 2309664 wikitext text/x-wiki == 2025-06-09 == * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 4050vakwvbwpmk1raui5b4xxw82riqd 2309665 2309664 2025-06-09T08:29:47Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json 2309665 wikitext text/x-wiki == 2025-06-09 == * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> m2k7u2jc9wzjfgli41q22v77ohoomb2 2309666 2309665 2025-06-09T08:44:53Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json 2309666 wikitext text/x-wiki == 2025-06-09 == * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> ant5yaif0dcd0g5am9secfflr0i4nud 2309667 2309666 2025-06-09T09:00:00Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json 2309667 wikitext text/x-wiki == 2025-06-09 == * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> e3mtdt3goeao217tv235l3veukzbr97 2309669 2309667 2025-06-09T09:15:09Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 (T396130)', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json 2309669 wikitext text/x-wiki == 2025-06-09 == * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> ovbrn03zzuskb0cijcop5igs4qsna8y 2309670 2309669 2025-06-09T09:15:23Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance 2309670 wikitext text/x-wiki == 2025-06-09 == * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 389w40alefyonlkyqfo3k7x5cddd77a 2309671 2309670 2025-06-09T09:15:30Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json 2309671 wikitext text/x-wiki == 2025-06-09 == * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> g5mcktepzbdei1uw1me5nyblo6ptxvc 2309672 2309671 2025-06-09T09:18:19Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json 2309672 wikitext text/x-wiki == 2025-06-09 == * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 7jv48ot4vxiwgwh1faskufrr1t22g8v 2309673 2309672 2025-06-09T09:20:38Z Stashbot 7414 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - T395228 2309673 wikitext text/x-wiki == 2025-06-09 == * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 1j7hz0ecg2z4ml50mc9aplu2yim49hh 2309675 2309673 2025-06-09T09:27:51Z Stashbot 7414 tappof@dns1004: START - running authdns-update 2309675 wikitext text/x-wiki == 2025-06-09 == * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 1pc6g9v1kj9htmrukas8z9dktillyw1 2309676 2309675 2025-06-09T09:28:37Z Stashbot 7414 tappof@dns1004: END - running authdns-update 2309676 wikitext text/x-wiki == 2025-06-09 == * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> hfwwqu9ux1eh0yovs3e0b6tnx6o97dm 2309677 2309676 2025-06-09T09:31:18Z Stashbot 7414 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - T395228 2309677 wikitext text/x-wiki == 2025-06-09 == * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> gwu591xc03rncihh6p2m59tqv3aj71p 2309680 2309677 2025-06-09T09:33:24Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json 2309680 wikitext text/x-wiki == 2025-06-09 == * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> dkha36okzxgfvpxjlql97wic098jr12 2309681 2309680 2025-06-09T09:34:57Z Stashbot 7414 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica 2309681 wikitext text/x-wiki == 2025-06-09 == * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> k6x9z5vj97c8la1hbpk51yrpfy3818x 2309682 2309681 2025-06-09T09:35:15Z Stashbot 7414 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica 2309682 wikitext text/x-wiki == 2025-06-09 == * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> nahfojca1idty3tc1ea0i2c9e6zvi30 2309684 2309682 2025-06-09T09:42:57Z Stashbot 7414 marostegui: Migrate s2 eqiad dbmaint to SBR T383795 2309684 wikitext text/x-wiki == 2025-06-09 == * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> azc6tj2sz7g29tu5hc0mw1sccfya2ts 2309687 2309684 2025-06-09T09:48:31Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json 2309687 wikitext text/x-wiki == 2025-06-09 == * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> o1omlsx8x6fq7h2cf07960mjveszflr 2309690 2309687 2025-06-09T10:03:41Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 (T396130)', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json 2309690 wikitext text/x-wiki == 2025-06-09 == * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> mi54sdlc4407yrk6iugc5m8alq6k3rk 2309691 2309690 2025-06-09T10:03:55Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance 2309691 wikitext text/x-wiki == 2025-06-09 == * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> chwg3ztve0cbzevwh1j0p1wsg52c8mq 2309693 2309691 2025-06-09T10:06:08Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 T395989', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json 2309693 wikitext text/x-wiki == 2025-06-09 == * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 3d0hgo7z9ghpp52ehz0rz2v757o0z7f 2309694 2309693 2025-06-09T10:06:36Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance 2309694 wikitext text/x-wiki == 2025-06-09 == * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> t9lo7f88mi0e42hxdqe45mdfmrb1ta2 2309695 2309694 2025-06-09T10:08:21Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance 2309695 wikitext text/x-wiki == 2025-06-09 == * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> n8rdbjhrwfap9abfznsdrbn7685ht4s 2309697 2309695 2025-06-09T10:12:46Z Stashbot 7414 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet 2309697 wikitext text/x-wiki == 2025-06-09 == * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> svkhgq3mmasbadwb36yn18islgas6vy 2309698 2309697 2025-06-09T10:12:59Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json 2309698 wikitext text/x-wiki == 2025-06-09 == * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> nxbmpi8eudx07eem3dswgx1t9hfbaau 2309699 2309698 2025-06-09T10:14:01Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json 2309699 wikitext text/x-wiki == 2025-06-09 == * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 9kw3q7cnp4t6ehv823zdxzd95l7xahd 2309700 2309699 2025-06-09T10:16:00Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 (T396130)', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json 2309700 wikitext text/x-wiki == 2025-06-09 == * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> sbszpv1o7rw3iioyzrq24hetecr19kx 2309701 2309700 2025-06-09T10:16:49Z Stashbot 7414 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - T395228 2309701 wikitext text/x-wiki == 2025-06-09 == * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> qr12jvdc5jsgth65op5ekdntggmjauf 2309702 2309701 2025-06-09T10:22:07Z Stashbot 7414 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance 2309702 wikitext text/x-wiki == 2025-06-09 == * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> d417rb33gvwqcozoexdsvga8phg8rav 2309703 2309702 2025-06-09T10:22:16Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json 2309703 wikitext text/x-wiki == 2025-06-09 == * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> m38bq209lqs2i4zn9qzh768stpwj8db 2309705 2309703 2025-06-09T10:29:52Z Stashbot 7414 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P{lvs1013.eqiad.wmnet} and A:liberica 2309705 wikitext text/x-wiki == 2025-06-09 == * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 4fqusiqf70sav0c5x83k5vjqk9svad7 2309706 2309705 2025-06-09T10:30:10Z Stashbot 7414 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs1013.eqiad.wmnet} and A:liberica 2309706 wikitext text/x-wiki == 2025-06-09 == * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 3ls051p9hbksfcbfazqx1dfhp4tukas 2309707 2309706 2025-06-09T10:31:06Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json 2309707 wikitext text/x-wiki == 2025-06-09 == * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> jhrg3ws4nb1x2shf2lmawsa73n7g83t 2309708 2309707 2025-06-09T10:31:15Z Stashbot 7414 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet 2309708 wikitext text/x-wiki == 2025-06-09 == * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 3fc4za3xzhnors33g0wrz5dadw6l1jd 2309709 2309708 2025-06-09T10:31:16Z Stashbot 7414 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet 2309709 wikitext text/x-wiki == 2025-06-09 == * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> ikq2rsb18eqqqm1c7vtffqtlqt1ys6v 2309710 2309709 2025-06-09T10:31:41Z Stashbot 7414 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet 2309710 wikitext text/x-wiki == 2025-06-09 == * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> r8yuo045cz2t4ldmni10k8brihnwnrl 2309711 2309710 2025-06-09T10:31:42Z Stashbot 7414 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet 2309711 wikitext text/x-wiki == 2025-06-09 == * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> jigkzyi974i9vblja2syfbe07q5bzw8 2309712 2309711 2025-06-09T10:33:14Z Stashbot 7414 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in 2309712 wikitext text/x-wiki == 2025-06-09 == * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 8ihme9h237s5jc1wcxmab62n45ehx0k 2309713 2309712 2025-06-09T10:33:48Z Stashbot 7414 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply 2309713 wikitext text/x-wiki == 2025-06-09 == * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> np36pcitljc17j7k201z5vmelfidvsy 2309714 2309713 2025-06-09T10:33:57Z Stashbot 7414 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply 2309714 wikitext text/x-wiki == 2025-06-09 == * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> ayfmwn7zviwgpjxtof20mvzw26k6z7x 2309715 2309714 2025-06-09T10:34:07Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json 2309715 wikitext text/x-wiki == 2025-06-09 == * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> ag15aj4lkzexq3qgo2twj64d4r4pnzl 2309716 2309715 2025-06-09T10:34:25Z Stashbot 7414 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in 2309716 wikitext text/x-wiki == 2025-06-09 == * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 7itshr4rv9u5bttb3vtw097atr2szuk 2309717 2309716 2025-06-09T10:46:12Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json 2309717 wikitext text/x-wiki == 2025-06-09 == * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 2avc5fvjt2bci5y0mg1wmaewolzx07f 2309719 2309717 2025-06-09T10:49:12Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json 2309719 wikitext text/x-wiki == 2025-06-09 == * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 7plfafaxjruj2jxa54ydazbbk3ccu12 2309721 2309719 2025-06-09T10:54:15Z Stashbot 7414 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply 2309721 wikitext text/x-wiki == 2025-06-09 == * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> btnl1xftj8yn794u6d5j5anycak5c6w 2309722 2309721 2025-06-09T10:54:22Z Stashbot 7414 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply 2309722 wikitext text/x-wiki == 2025-06-09 == * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> gqu4jkvnjsxxbvf8abiv1ubehwft9wo 2309723 2309722 2025-06-09T11:01:22Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 (T396130)', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json 2309723 wikitext text/x-wiki == 2025-06-09 == * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> qsuuoun8w2b15yq7pw0dsic6cjjkham 2309724 2309723 2025-06-09T11:01:34Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance 2309724 wikitext text/x-wiki == 2025-06-09 == * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> lxft284u76exek9t5abwvpvtn1f1ygp 2309725 2309724 2025-06-09T11:01:42Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json 2309725 wikitext text/x-wiki == 2025-06-09 == * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> cae8vyyqr1uha8j1g2wohj912sziod8 2309726 2309725 2025-06-09T11:04:20Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json 2309726 wikitext text/x-wiki == 2025-06-09 == * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 98tj1tau72p9iris9lmejc79grhpo6c 2309727 2309726 2025-06-09T11:08:09Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json 2309727 wikitext text/x-wiki == 2025-06-09 == * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> fv0k9bhzhtqdh3y7pzt2ah0j80w8agw 2309728 2309727 2025-06-09T11:18:41Z Stashbot 7414 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in 2309728 wikitext text/x-wiki == 2025-06-09 == * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> nkxwgszqpj8oxlqhfzu0ax873p4e8po 2309729 2309728 2025-06-09T11:19:28Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 (T395241)', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json 2309729 wikitext text/x-wiki == 2025-06-09 == * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 5v6kd8g9qxcm6x9c3dngsiup47dwa9w 2309730 2309729 2025-06-09T11:19:45Z Stashbot 7414 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance 2309730 wikitext text/x-wiki == 2025-06-09 == * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 7cx3hjgrkebcv7aoda05anbgotpydl5 2309731 2309730 2025-06-09T11:19:51Z Stashbot 7414 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in 2309731 wikitext text/x-wiki == 2025-06-09 == * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> aujgekykb4qil7n7rh1j124tledyhbt 2309732 2309731 2025-06-09T11:19:53Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 (T395241)', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json 2309732 wikitext text/x-wiki == 2025-06-09 == * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> m7z0ukcmclvryarv15sretd3r3nbhvo 2309733 2309732 2025-06-09T11:23:15Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json 2309733 wikitext text/x-wiki == 2025-06-09 == * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 0xxfb6itqhf6zomo45d8rs8dpi2hlwm 2309734 2309733 2025-06-09T11:31:18Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 (T395241)', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json 2309734 wikitext text/x-wiki == 2025-06-09 == * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 9w2w78e1w75njzv4esa7ecwzmsngwip 2309735 2309734 2025-06-09T11:38:22Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json 2309735 wikitext text/x-wiki == 2025-06-09 == * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> tl6re4xns4yxd6ikby9zordt2wjsq5e 2309736 2309735 2025-06-09T11:46:23Z Stashbot 7414 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json 2309736 wikitext text/x-wiki == 2025-06-09 == * 11:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> cn9naacz3qudhmhookgzy4yes48im5r 2309737 2309736 2025-06-09T11:53:23Z Stashbot 7414 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . 2309737 wikitext text/x-wiki == 2025-06-09 == * 11:53 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> esykiz4cqb2yg0xz181kbdgh86xzqde 2309738 2309737 2025-06-09T11:53:30Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 (T396130)', diff saved to https://phabricator.wikimedia.org/P77257 and previous config saved to /var/cache/conftool/dbconfig/20250609-115328-marostegui.json 2309738 wikitext text/x-wiki == 2025-06-09 == * 11:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77257 and previous config saved to /var/cache/conftool/dbconfig/20250609-115328-marostegui.json * 11:53 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> qkk12pipy42jap8icuu28c96baco26m 2309739 2309738 2025-06-09T11:53:44Z Stashbot 7414 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance 2309739 wikitext text/x-wiki == 2025-06-09 == * 11:53 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance * 11:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77257 and previous config saved to /var/cache/conftool/dbconfig/20250609-115328-marostegui.json * 11:53 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> 32w9nhuimn7gucol9zcncepwlgcnuuo 2309740 2309739 2025-06-09T11:53:52Z Stashbot 7414 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2224 (T396130)', diff saved to https://phabricator.wikimedia.org/P77258 and previous config saved to /var/cache/conftool/dbconfig/20250609-115350-marostegui.json 2309740 wikitext text/x-wiki == 2025-06-09 == * 11:53 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2224 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77258 and previous config saved to /var/cache/conftool/dbconfig/20250609-115350-marostegui.json * 11:53 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2224.codfw.wmnet with reason: Maintenance * 11:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77257 and previous config saved to /var/cache/conftool/dbconfig/20250609-115328-marostegui.json * 11:53 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P77256 and previous config saved to /var/cache/conftool/dbconfig/20250609-114622-fceratto.json * 11:38 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77255 and previous config saved to /var/cache/conftool/dbconfig/20250609-113821-marostegui.json * 11:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77254 and previous config saved to /var/cache/conftool/dbconfig/20250609-113113-fceratto.json * 11:23 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P77253 and previous config saved to /var/cache/conftool/dbconfig/20250609-112314-marostegui.json * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1190 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77252 and previous config saved to /var/cache/conftool/dbconfig/20250609-111951-fceratto.json * 11:19 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2151* gradually with 4 steps - Pooling in * 11:19 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1190.eqiad.wmnet with reason: Maintenance * 11:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77250 and previous config saved to /var/cache/conftool/dbconfig/20250609-111926-fceratto.json * 11:18 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2158* gradually with 4 steps - Pooling in * 11:08 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77247 and previous config saved to /var/cache/conftool/dbconfig/20250609-110807-marostegui.json * 11:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77245 and previous config saved to /var/cache/conftool/dbconfig/20250609-110418-fceratto.json * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2217 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77243 and previous config saved to /var/cache/conftool/dbconfig/20250609-110140-marostegui.json * 11:01 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2217.codfw.wmnet with reason: Maintenance * 11:01 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77242 and previous config saved to /var/cache/conftool/dbconfig/20250609-110118-marostegui.json * 10:54 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:54 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P77240 and previous config saved to /var/cache/conftool/dbconfig/20250609-104911-fceratto.json * 10:46 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77238 and previous config saved to /var/cache/conftool/dbconfig/20250609-104611-marostegui.json * 10:34 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2151* gradually with 4 steps - Pooling in * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77236 and previous config saved to /var/cache/conftool/dbconfig/20250609-103404-fceratto.json * 10:33 hnowlan@deploy1003: helmfile [staging] DONE helmfile.d/services/rest-gateway: apply * 10:33 hnowlan@deploy1003: helmfile [staging] START helmfile.d/services/rest-gateway: apply * 10:33 fceratto@cumin1002: START - Cookbook sre.mysql.pool db2158* gradually with 4 steps - Pooling in * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2151.codfw.wmnet * 10:31 fceratto@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db2158.codfw.wmnet * 10:31 fceratto@cumin1002: START - Cookbook sre.hosts.remove-downtime for db2158.codfw.wmnet * 10:31 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P77234 and previous config saved to /var/cache/conftool/dbconfig/20250609-103104-marostegui.json * 10:30 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:29 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 10:22 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1160 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77233 and previous config saved to /var/cache/conftool/dbconfig/20250609-102214-fceratto.json * 10:22 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Maintenance * 10:16 vgutierrez: repooling lvs1013 handling ncredir@eqiad using katran based load balancing - [[phab:T395228|T395228]] * 10:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2214 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77232 and previous config saved to /var/cache/conftool/dbconfig/20250609-101557-marostegui.json * 10:14 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214', diff saved to https://phabricator.wikimedia.org/P77231 and previous config saved to /var/cache/conftool/dbconfig/20250609-101400-marostegui.json * 10:12 marostegui@cumin1002: dbctl commit (dc=all): 'db2214 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77230 and previous config saved to /var/cache/conftool/dbconfig/20250609-101258-root.json * 10:12 fceratto@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of db2158.codfw.wmnet onto db2151.codfw.wmnet * 10:08 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Maintenance * 10:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2214 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77229 and previous config saved to /var/cache/conftool/dbconfig/20250609-100605-marostegui.json * 10:03 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2197.codfw.wmnet with reason: Maintenance * 10:03 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77228 and previous config saved to /var/cache/conftool/dbconfig/20250609-100337-marostegui.json * 09:48 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77227 and previous config saved to /var/cache/conftool/dbconfig/20250609-094830-marostegui.json * 09:42 marostegui: Migrate s2 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 09:35 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:34 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 09:33 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P77226 and previous config saved to /var/cache/conftool/dbconfig/20250609-093323-marostegui.json * 09:31 vgutierrez: depooling lvs1013 before switching ncredir@eqiad to katran based load balancing - [[phab:T395228|T395228]] * 09:28 tappof@dns1004: END - running authdns-update * 09:27 tappof@dns1004: START - running authdns-update * 09:20 vgutierrez: upload liberica 0.17 to apt.wm.o (bookworm-wikimedia) - [[phab:T395228|T395228]] * 09:18 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77225 and previous config saved to /var/cache/conftool/dbconfig/20250609-091816-marostegui.json * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2193 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77224 and previous config saved to /var/cache/conftool/dbconfig/20250609-091528-marostegui.json * 09:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2193.codfw.wmnet with reason: Maintenance * 09:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77223 and previous config saved to /var/cache/conftool/dbconfig/20250609-091506-marostegui.json * 09:00 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77222 and previous config saved to /var/cache/conftool/dbconfig/20250609-085959-marostegui.json * 08:44 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P77221 and previous config saved to /var/cache/conftool/dbconfig/20250609-084452-marostegui.json * 08:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77220 and previous config saved to /var/cache/conftool/dbconfig/20250609-082945-marostegui.json * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2180 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77219 and previous config saved to /var/cache/conftool/dbconfig/20250609-082655-marostegui.json * 08:26 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2180.codfw.wmnet with reason: Maintenance * 08:26 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77218 and previous config saved to /var/cache/conftool/dbconfig/20250609-082633-marostegui.json * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77217 and previous config saved to /var/cache/conftool/dbconfig/20250609-081126-marostegui.json * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2243.codfw.wmnet onto db2244.codfw.wmnet * 08:08 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 07:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P77215 and previous config saved to /var/cache/conftool/dbconfig/20250609-075619-marostegui.json * 07:41 marostegui@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77213 and previous config saved to /var/cache/conftool/dbconfig/20250609-074112-marostegui.json * 07:34 marostegui@cumin1002: dbctl commit (dc=all): 'Depooling db2169 ([[phab:T396130|T396130]])', diff saved to https://phabricator.wikimedia.org/P77211 and previous config saved to /var/cache/conftool/dbconfig/20250609-073403-marostegui.json * 07:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2169.codfw.wmnet with reason: Maintenance * 07:28 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2158.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2151.codfw.wmnet with reason: Maintenance * 07:23 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2244 gradually with 4 steps - Pool db2244.codfw.wmnet in after cloning * 06:22 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:42 marostegui: Add MariaDB 10.11.13 to the repo [[phab:T395663|T395663]] * 05:37 marostegui@cumin1002: START - Cookbook sre.mysql.pool db2243 gradually with 4 steps - Pool db2243.codfw.wmnet in after cloning * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Add db2244 to dbctl depooled [[phab:T393989|T393989]]', diff saved to https://phabricator.wikimedia.org/P77205 and previous config saved to /var/cache/conftool/dbconfig/20250609-052451-marostegui.json * 05:00 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.depool db2243 - Depool db2243.codfw.wmnet to then clone it to db2244.codfw.wmnet - marostegui@cumin1002 * 05:00 marostegui@cumin1002: START - Cookbook sre.mysql.clone of db2243.codfw.wmnet onto db2244.codfw.wmnet == 2025-06-08 == * 12:04 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396290|T396290]] and [[phab:T396291|T396291]] * 09:53 ryankemper@cumin2002: END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) == 2025-06-07 == * 11:43 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2007.codfw.wmnet with OS bullseye * 11:07 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2006.codfw.wmnet with OS bullseye * 08:12 elukey: restart apache2 / php-fpm on phab1004 * 04:18 mutante: restarted apache on phab1004 == 2025-06-06 == * 21:33 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:25 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:19 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:15 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:02 bking@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: downtime before decom * 20:42 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:38 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2007.codfw.wmnet with reason: host reimage * 20:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2007.codfw.wmnet with OS bullseye * 20:06 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 20:03 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2006.codfw.wmnet with reason: host reimage * 19:45 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:31 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2006.codfw.wmnet with OS bullseye * 19:11 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2244.codfw.wmnet with OS bookworm * 19:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 18:25 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:06 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" * 17:49 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:46 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on db2244.codfw.wmnet with reason: host reimage * 17:29 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host db2244.codfw.wmnet with OS bookworm * 17:29 jhancock@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2244'] * 17:29 jhancock@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2244'] * 17:20 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:10 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 17:08 bking@cumin2002: END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:06 bking@cumin2002: START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: [[phab:T383811|T383811]] - bking@cumin2002 * 17:00 sukhe: forced agent run on O:alerting_host to reload vopsbot to add cdobbins * 16:57 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:57 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:56 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:55 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:55 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:42 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 16:08 sbassett: Deployed security update to fix [[phab:T396111|T396111]] * 15:41 eevans@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet * 15:34 eevans@cumin1002: START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet * 15:24 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:23 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 15:19 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 15:19 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 14:53 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:42 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:24 sukhe@dns1004: END - running authdns-update * 14:23 sukhe@dns1004: START - running authdns-update * 14:23 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 14:23 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2004.wikimedia.org * 14:22 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:22 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1005.wikimedia.org * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1005.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1005*<nowiki>}</nowiki> and (A:dnsbox) * 14:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2004.wikimedia.org * 14:10 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2004*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:51 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns2006.wikimedia.org * 13:49 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:49 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns1006.wikimedia.org * 13:40 vgutierrez@cumin1003: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:40 vgutierrez@cumin1003: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 13:35 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns2006.wikimedia.org * 13:35 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns2006*<nowiki>}</nowiki> and (A:dnsbox) * 13:34 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns1006.wikimedia.org * 13:34 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns1006*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:32 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns3003.wikimedia.org * 13:31 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:31 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns6001.wikimedia.org * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:21 cmooney@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:21 cmooney@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add back entry for mistakenly deleted ssw1-a8-codfw IP - cmooney@cumin1003" * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns6001.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns6001*<nowiki>}</nowiki> and (A:dnsbox) * 13:18 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns3003.wikimedia.org * 13:18 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns3003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 13:13 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns5003.wikimedia.org * 13:09 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 13:09 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns4003.wikimedia.org * 13:06 cmooney@cumin1003: START - Cookbook sre.dns.netbox * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns5003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns5003*<nowiki>}</nowiki> and (A:dnsbox) * 12:58 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns4003.wikimedia.org * 12:58 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on P<nowiki>{</nowiki>dns4003*<nowiki>}</nowiki> and (A:dnsbox) * 12:20 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:20 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2158 - Depool db2158.codfw.wmnet to then clone it to db2151.codfw.wmnet - fceratto@cumin1002 * 12:19 fceratto@cumin1002: START - Cookbook sre.mysql.clone of db2158.codfw.wmnet onto db2151.codfw.wmnet * 12:03 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply * 12:03 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/rest-gateway: apply * 11:52 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db2151.codfw.wmnet with reason: Disabling notifications * 11:42 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply * 11:42 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/services/rest-gateway: apply * 11:32 jayme@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:32 jayme@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:31 jayme@deploy1003: helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. * 11:30 jayme@deploy1003: helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:28 jayme@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. * 11:27 jayme@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. * 11:26 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:25 jayme@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. * 11:22 jayme@deploy1003: helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. * 11:21 jayme@deploy1003: helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:18 jayme@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:17 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:12 jayme@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 09:40 fceratto@cumin1002: END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2151* - Log issue and disk filled up * 09:40 fceratto@cumin1002: START - Cookbook sre.mysql.depool db2151* - Log issue and disk filled up * 09:05 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 09:04 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 08:37 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398044 * 08:36 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 398044 * 08:35 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46997 * 08:34 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46997 * 08:33 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 394065 * 08:33 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 394065 * 08:32 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 46562 * 08:32 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 46562 * 08:30 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 13150 * 08:30 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 13150 * 08:28 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 199524 * 08:25 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 199524 * 08:24 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:24 ayounsi@cumin1003: END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63199 * 08:23 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:23 ayounsi@cumin1003: START - Cookbook sre.network.peering with action 'email' for AS: 63199 * 08:22 volans@cumin1003: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 08:22 volans@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cirrussearch2113.codfw.wmnet * 08:11 volans@cumin1003: START - Cookbook sre.hosts.reboot-single for host cirrussearch2113.codfw.wmnet * 08:10 volans@cumin1003: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cirrussearch2113.codfw.wmnet * 07:52 ryankemper@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on cirrussearch2113.codfw.wmnet with reason: [[phab:T394543|T394543]] * 06:15 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1157-1159].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:14 stevemunene@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1160-1162].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB group 5 - rack F1 * 06:13 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device asw1-b3-magru * 06:13 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device asw1-b3-magru * 05:42 XioNoX: push pfw policies - [[phab:T395904|T395904]] * 03:02 eevans@cumin1002: END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw * 02:57 eevans@cumin1002: START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw == 2025-06-05 == * 22:05 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:04 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:03 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 22:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:58 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:57 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:42 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:29 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 21:00 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 20:59 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 20:57 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:56 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:24 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] (duration: 10m 05s) * 20:17 jdlrobson@deploy1003: jdlrobson: Continuing with sync * 20:16 jdlrobson@deploy1003: jdlrobson: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:14 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1154098{{!}}Fix back compat for data-chart (T395462)]] * 20:09 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 19:24 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:22 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:15 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:14 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:13 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 19:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 19:12 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 19:12 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:52 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 18:49 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 18:48 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:43 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:32 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:30 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:21 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group2 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:21 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:20 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 18:20 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 18:19 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:18 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply * 18:17 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply * 18:17 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 18:17 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:17 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 18:05 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 18:04 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 17:28 bd808@deploy1003: helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply * 17:27 bd808@deploy1003: helmfile [eqiad] START helmfile.d/services/developer-portal: apply * 17:25 bd808@deploy1003: helmfile [codfw] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [codfw] START helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] DONE helmfile.d/services/developer-portal: apply * 17:24 bd808@deploy1003: helmfile [staging] START helmfile.d/services/developer-portal: apply * 17:21 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:21 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 17:15 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 17:15 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 16:54 brett@cumin2002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:54 brett@cumin2002: START - Cookbook sre.hosts.remove-downtime for acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet * 16:51 mfossati@deploy1003: Finished deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 (duration: 01m 16s) * 16:50 mfossati@deploy1003: Started deploy [airflow-dags/platform_eng@930d28b]: adapt check_bad_parsing to dumps 2.0 * 16:50 brett@cumin2002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on acmechief2002.codfw.wmnet,acmechief1002.eqiad.wmnet,acmechief-test2001.codfw.wmnet,acmechief-test1001.eqiad.wmnet with reason: Reboots * 16:27 jdlrobson@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] (duration: 11m 23s) * 16:20 jdlrobson@deploy1003: jdlrobson, jdrewniak: Continuing with sync * 16:18 jdlrobson@deploy1003: jdlrobson, jdrewniak: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77191 and previous config saved to /var/cache/conftool/dbconfig/20250605-161701-fceratto.json * 16:16 jdlrobson@deploy1003: Started scap sync-world: Backport for [[gerrit:1153750{{!}}Revert "Deploy survey to en at twenty percent"]] * 16:12 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:03 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host db2244.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:02 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db2244 * 16:02 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host db2244 * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77190 and previous config saved to /var/cache/conftool/dbconfig/20250605-160154-fceratto.json * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:01 jhancock@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 16:01 jhancock@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2244 to codfw - jhancock@cumin2002" * 15:57 jhancock@cumin2002: START - Cookbook sre.dns.netbox * 15:55 aokoth@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237', diff saved to https://phabricator.wikimedia.org/P77189 and previous config saved to /var/cache/conftool/dbconfig/20250605-154647-fceratto.json * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77188 and previous config saved to /var/cache/conftool/dbconfig/20250605-153139-fceratto.json * 15:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2237 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77187 and previous config saved to /var/cache/conftool/dbconfig/20250605-152314-fceratto.json * 15:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2237.codfw.wmnet with reason: Maintenance * 15:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77186 and previous config saved to /var/cache/conftool/dbconfig/20250605-152248-fceratto.json * 15:08 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2004-dev.codfw.wmnet * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77185 and previous config saved to /var/cache/conftool/dbconfig/20250605-150741-fceratto.json * 14:59 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2004-dev.codfw.wmnet * 14:59 taavi@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudlb2003-dev.codfw.wmnet * 14:53 damilare: payments-wiki upgraded from {{Gerrit|2d8b655a}} to {{Gerrit|aa102260}} * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236', diff saved to https://phabricator.wikimedia.org/P77184 and previous config saved to /var/cache/conftool/dbconfig/20250605-145234-fceratto.json * 14:50 taavi@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudlb2003-dev.codfw.wmnet * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77183 and previous config saved to /var/cache/conftool/dbconfig/20250605-143724-fceratto.json * 14:29 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2236 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77182 and previous config saved to /var/cache/conftool/dbconfig/20250605-142908-fceratto.json * 14:29 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2236.codfw.wmnet with reason: Maintenance * 14:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77181 and previous config saved to /var/cache/conftool/dbconfig/20250605-142840-fceratto.json * 14:20 taavi@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:20 taavi@cumin1002: START - Cookbook sre.dns.wipe-cache 'private.codfw.wikimedia.cloud$' on codfw recursors * 14:19 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395468|T395468]] (duration: 39m 39s) * 14:19 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 14:18 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:18 jhancock@cumin2002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host sretest2007 * 14:18 jhancock@cumin2002: START - Cookbook sre.network.configure-switch-interfaces for host sretest2007 * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:17 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: assign cloud-private v6 addresses for codfw1dev devices - taavi@cumin1002" * 14:17 tgr: deploying a PrivateSettings config change * 14:13 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77179 and previous config saved to /var/cache/conftool/dbconfig/20250605-141333-fceratto.json * 14:13 taavi@cumin1002: START - Cookbook sre.dns.netbox * 14:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77178 and previous config saved to /var/cache/conftool/dbconfig/20250605-141108-root.json * 13:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219', diff saved to https://phabricator.wikimedia.org/P77177 and previous config saved to /var/cache/conftool/dbconfig/20250605-135826-fceratto.json * 13:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77176 and previous config saved to /var/cache/conftool/dbconfig/20250605-135603-root.json * 13:51 marostegui: Migrate s2 codfw to SBR dbmaint [[phab:T383795|T383795]] * 13:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply * 13:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet * 13:49 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply * 13:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply * 13:49 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply * 13:48 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply * 13:48 vgutierrez: upload liberica 0.16 to bookworm-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 13:48 cgoubert@deploy1003: helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply * 13:47 cgoubert@deploy1003: helmfile [staging] START helmfile.d/services/shellbox-constraints: apply * 13:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet * 13:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77175 and previous config saved to /var/cache/conftool/dbconfig/20250605-134319-fceratto.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77174 and previous config saved to /var/cache/conftool/dbconfig/20250605-134158-root.json * 13:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77173 and previous config saved to /var/cache/conftool/dbconfig/20250605-134153-root.json * 13:40 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77172 and previous config saved to /var/cache/conftool/dbconfig/20250605-134057-root.json * 13:40 moritzm: installing net-tools bugfix updates for bookworm * 13:40 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395468|T395468]] * 13:38 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet * 13:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet * 13:35 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2219 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77171 and previous config saved to /var/cache/conftool/dbconfig/20250605-133500-fceratto.json * 13:34 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2219.codfw.wmnet with reason: Maintenance * 13:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77170 and previous config saved to /var/cache/conftool/dbconfig/20250605-133434-fceratto.json * 13:31 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77169 and previous config saved to /var/cache/conftool/dbconfig/20250605-132652-root.json * 13:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77168 and previous config saved to /var/cache/conftool/dbconfig/20250605-132648-root.json * 13:25 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77167 and previous config saved to /var/cache/conftool/dbconfig/20250605-132552-root.json * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts netflow7001.magru.wmnet * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:23 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: netflow7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 13:21 Lucas_WMDE: UTC afternoon backport+config window done * 13:19 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77166 and previous config saved to /var/cache/conftool/dbconfig/20250605-131926-fceratto.json * 13:18 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet * 13:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 13:16 gkyziridis@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] (duration: 11m 51s) * 13:12 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts netflow7001.magru.wmnet * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77165 and previous config saved to /var/cache/conftool/dbconfig/20250605-131147-root.json * 13:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77164 and previous config saved to /var/cache/conftool/dbconfig/20250605-131142-root.json * 13:10 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77163 and previous config saved to /var/cache/conftool/dbconfig/20250605-131046-root.json * 13:09 gkyziridis@deploy1003: gkyziridis: Continuing with sync * 13:07 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:04 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1153945{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (excluding azwiki) (T395823)]] * 13:04 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210', diff saved to https://phabricator.wikimedia.org/P77162 and previous config saved to /var/cache/conftool/dbconfig/20250605-130419-fceratto.json * 13:03 taavi@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 13:03 taavi@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "remove outdated octavia net - taavi@cumin1002" * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77161 and previous config saved to /var/cache/conftool/dbconfig/20250605-125641-root.json * 12:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77160 and previous config saved to /var/cache/conftool/dbconfig/20250605-125637-root.json * 12:55 marostegui@cumin1002: dbctl commit (dc=all): 'db2151 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77159 and previous config saved to /var/cache/conftool/dbconfig/20250605-125540-root.json * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet * 12:54 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet * 12:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2151.codfw.wmnet with reason: Maintenance * 12:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2151 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77158 and previous config saved to /var/cache/conftool/dbconfig/20250605-125057-marostegui.json * 12:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77157 and previous config saved to /var/cache/conftool/dbconfig/20250605-124912-fceratto.json * 12:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet * 12:43 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77156 and previous config saved to /var/cache/conftool/dbconfig/20250605-124136-root.json * 12:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77155 and previous config saved to /var/cache/conftool/dbconfig/20250605-124131-root.json * 12:41 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2210 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77154 and previous config saved to /var/cache/conftool/dbconfig/20250605-124110-fceratto.json * 12:41 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance * 12:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77153 and previous config saved to /var/cache/conftool/dbconfig/20250605-124043-fceratto.json * 12:32 jakob@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply * 12:32 jakob@deploy1003: helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply * 12:31 jakob@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply * 12:30 jakob@deploy1003: helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2042 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77151 and previous config saved to /var/cache/conftool/dbconfig/20250605-122631-root.json * 12:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2045 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77150 and previous config saved to /var/cache/conftool/dbconfig/20250605-122625-root.json * 12:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77149 and previous config saved to /var/cache/conftool/dbconfig/20250605-122537-fceratto.json * 12:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2042,2045].codfw.wmnet with reason: Maintenance * 12:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2042 es2045 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77147 and previous config saved to /var/cache/conftool/dbconfig/20250605-122035-marostegui.json * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P77146 and previous config saved to /var/cache/conftool/dbconfig/20250605-121029-fceratto.json * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77145 and previous config saved to /var/cache/conftool/dbconfig/20250605-115522-fceratto.json * 11:49 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:48 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 11:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2206 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77144 and previous config saved to /var/cache/conftool/dbconfig/20250605-114711-fceratto.json * 11:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2199.codfw.wmnet with reason: Maintenance * 11:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77143 and previous config saved to /var/cache/conftool/dbconfig/20250605-114213-fceratto.json * 11:35 moritzm: installing Linux 5.10.237 on Bullseye hosts * 11:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77142 and previous config saved to /var/cache/conftool/dbconfig/20250605-112706-fceratto.json * 11:26 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2032.codfw.wmnet * 11:25 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2032.codfw.wmnet * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77141 and previous config saved to /var/cache/conftool/dbconfig/20250605-112518-root.json * 11:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77140 and previous config saved to /var/cache/conftool/dbconfig/20250605-112511-root.json * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2032.codfw.wmnet * 11:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P77139 and previous config saved to /var/cache/conftool/dbconfig/20250605-111158-fceratto.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77138 and previous config saved to /var/cache/conftool/dbconfig/20250605-111013-root.json * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77137 and previous config saved to /var/cache/conftool/dbconfig/20250605-111005-root.json * 11:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2032.codfw.wmnet * 11:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet * 11:03 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 11:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet * 11:02 gehel: restarting Blazegraph on wdqs1023 to address allocator decreasing alert * 10:57 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet * 10:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77136 and previous config saved to /var/cache/conftool/dbconfig/20250605-105650-fceratto.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77135 and previous config saved to /var/cache/conftool/dbconfig/20250605-105507-root.json * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77134 and previous config saved to /var/cache/conftool/dbconfig/20250605-105500-root.json * 10:52 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77133 and previous config saved to /var/cache/conftool/dbconfig/20250605-105216-root.json * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2179 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77132 and previous config saved to /var/cache/conftool/dbconfig/20250605-104928-fceratto.json * 10:49 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2179.codfw.wmnet with reason: Maintenance * 10:49 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77131 and previous config saved to /var/cache/conftool/dbconfig/20250605-104912-fceratto.json * 10:42 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet * 10:41 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77130 and previous config saved to /var/cache/conftool/dbconfig/20250605-104002-root.json * 10:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77129 and previous config saved to /var/cache/conftool/dbconfig/20250605-103954-root.json * 10:39 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1165.eqiad.wmnet * 10:37 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77128 and previous config saved to /var/cache/conftool/dbconfig/20250605-103711-root.json * 10:34 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet * 10:34 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77127 and previous config saved to /var/cache/conftool/dbconfig/20250605-103403-fceratto.json * 10:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet * 10:32 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1165.eqiad.wmnet * 10:31 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1164.eqiad.wmnet * 10:31 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-experimental: apply * 10:30 Ammar: Ran fixStuckGlobalRename.php for [[phab:T396054|T396054]] * 10:27 claime: Manual run of generatecaptcha on mw-cron with delete - [[phab:T388531|T388531]] * 10:26 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 10:26 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77126 and previous config saved to /var/cache/conftool/dbconfig/20250605-102456-root.json * 10:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77125 and previous config saved to /var/cache/conftool/dbconfig/20250605-102449-root.json * 10:24 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1164.eqiad.wmnet * 10:23 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2041 to es4 master and es2044 as es5 master', diff saved to https://phabricator.wikimedia.org/P77124 and previous config saved to /var/cache/conftool/dbconfig/20250605-102319-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2029.codfw.wmnet * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2029.codfw.wmnet * 10:22 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77123 and previous config saved to /var/cache/conftool/dbconfig/20250605-102205-root.json * 10:21 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-experimental: apply * 10:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P77122 and previous config saved to /var/cache/conftool/dbconfig/20250605-101856-fceratto.json * 10:16 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2029.codfw.wmnet * 10:16 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 10:09 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2046 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77121 and previous config saved to /var/cache/conftool/dbconfig/20250605-100950-root.json * 10:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2043 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77120 and previous config saved to /var/cache/conftool/dbconfig/20250605-100943-root.json * 10:08 cgoubert@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] (duration: 10m 36s) * 10:08 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2007.codfw.wmnet * 10:07 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77119 and previous config saved to /var/cache/conftool/dbconfig/20250605-100700-root.json * 10:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77118 and previous config saved to /var/cache/conftool/dbconfig/20250605-100527-root.json * 10:04 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es[2043,2046].codfw.wmnet with reason: Maintenance * 10:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2043 es2046 [[phab:T395241|T395241]]', diff saved to https://phabricator.wikimedia.org/P77117 and previous config saved to /var/cache/conftool/dbconfig/20250605-100419-marostegui.json * 10:04 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser (duration: 00m 16s) * 10:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@d11bd51]: Update webrequest-test hive jar for ua-parser * 10:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77116 and previous config saved to /var/cache/conftool/dbconfig/20250605-100349-fceratto.json * 10:03 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2029.codfw.wmnet * 10:01 cgoubert@deploy1003: gkyziridis, cgoubert: Continuing with sync * 10:00 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2007.codfw.wmnet * 10:00 cgoubert@deploy1003: gkyziridis, cgoubert: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet * 09:58 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1007.eqiad.wmnet * 09:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet * 09:57 cgoubert@deploy1003: Started scap sync-world: Backport for [[gerrit:1153937{{!}}Revert "ores-extension: enable extension with revertrisk filter for second batch of wikis"]] * 09:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77115 and previous config saved to /var/cache/conftool/dbconfig/20250605-095415-fceratto.json * 09:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2172.codfw.wmnet with reason: Maintenance * 09:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77114 and previous config saved to /var/cache/conftool/dbconfig/20250605-095347-fceratto.json * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:53 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich: apply * 09:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet * 09:51 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77113 and previous config saved to /var/cache/conftool/dbconfig/20250605-095155-root.json * 09:50 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1007.eqiad.wmnet * 09:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77112 and previous config saved to /var/cache/conftool/dbconfig/20250605-095022-root.json * 09:44 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet * 09:44 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:43 claime: Re-enabling CPU/RAM limits on mw-cron - [[phab:T395436|T395436]] * 09:43 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77111 and previous config saved to /var/cache/conftool/dbconfig/20250605-093840-fceratto.json * 09:36 marostegui@cumin1002: dbctl commit (dc=all): 'db2158 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77110 and previous config saved to /var/cache/conftool/dbconfig/20250605-093649-root.json * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet * 09:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet * 09:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:35 hnowlan: Migrate reading lists API out of restbase for group1 via rest-gateway * 09:35 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77109 and previous config saved to /var/cache/conftool/dbconfig/20250605-093515-root.json * 09:31 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2158.codfw.wmnet with reason: Maintenance * 09:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2158 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77108 and previous config saved to /var/cache/conftool/dbconfig/20250605-093107-marostegui.json * 09:30 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet * 09:26 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet * 09:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet * 09:24 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add netflow7002 - jmm@cumin1003" * 09:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet * 09:23 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155', diff saved to https://phabricator.wikimedia.org/P77107 and previous config saved to /var/cache/conftool/dbconfig/20250605-092333-fceratto.json * 09:20 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77106 and previous config saved to /var/cache/conftool/dbconfig/20250605-092010-root.json * 09:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet * 09:17 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping2004.codfw.wmnet * 09:13 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping2004.codfw.wmnet * 09:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77105 and previous config saved to /var/cache/conftool/dbconfig/20250605-090825-fceratto.json * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:07 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:05 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77104 and previous config saved to /var/cache/conftool/dbconfig/20250605-090504-root.json * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2155 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77103 and previous config saved to /var/cache/conftool/dbconfig/20250605-085847-fceratto.json * 08:58 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2155.codfw.wmnet with reason: Maintenance * 08:58 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77102 and previous config saved to /var/cache/conftool/dbconfig/20250605-085820-fceratto.json * 08:53 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet * 08:50 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet * 08:50 marostegui@cumin1002: dbctl commit (dc=all): 'db2169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77101 and previous config saved to /var/cache/conftool/dbconfig/20250605-084959-root.json * 08:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1004.eqiad.wmnet * 08:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2169.codfw.wmnet with reason: Maintenance * 08:45 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2169 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77100 and previous config saved to /var/cache/conftool/dbconfig/20250605-084557-marostegui.json * 08:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet * 08:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ping1004.eqiad.wmnet * 08:43 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77099 and previous config saved to /var/cache/conftool/dbconfig/20250605-084313-fceratto.json * 08:35 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:34 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:32 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet * 08:28 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147', diff saved to https://phabricator.wikimedia.org/P77098 and previous config saved to /var/cache/conftool/dbconfig/20250605-082806-fceratto.json * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow7002.magru.wmnet * 08:24 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host netflow7002.magru.wmnet with OS bookworm * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:24 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77097 and previous config saved to /var/cache/conftool/dbconfig/20250605-081258-fceratto.json * 08:10 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet * 08:09 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet * 08:06 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on netflow7002.magru.wmnet with reason: host reimage * 08:03 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet * 08:03 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2147 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77096 and previous config saved to /var/cache/conftool/dbconfig/20250605-080310-fceratto.json * 08:03 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 08:00 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 07:56 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet * 07:50 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet * 07:44 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:43 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet * 07:38 gkyziridis@deploy1003: Sync cancelled. * 07:36 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host netflow7002.magru.wmnet with OS bookworm * 07:36 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet * 07:32 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:27 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache netflow7002.magru.wmnet on all recursors * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 07:26 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow7002.magru.wmnet - jmm@cumin1003" * 07:26 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet * 07:23 gkyziridis@deploy1003: gkyziridis: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:22 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:22 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host netflow7002.magru.wmnet * 07:21 gkyziridis@deploy1003: Started scap sync-world: Backport for [[gerrit:1152682{{!}}ores-extension: enable extension with revertrisk filter for second batch of wikis (T395823)]] * 07:19 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77095 and previous config saved to /var/cache/conftool/dbconfig/20250605-064137-root.json * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77094 and previous config saved to /var/cache/conftool/dbconfig/20250605-062629-root.json * 06:19 marostegui: Change datadir on pc8 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:19 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77093 and previous config saved to /var/cache/conftool/dbconfig/20250605-061929-marostegui.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc8 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77092 and previous config saved to /var/cache/conftool/dbconfig/20250605-061612-marostegui.json * 06:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2018.codfw.wmnet,pc1018.eqiad.wmnet with reason: Maintenance * 06:15 marostegui: Change datadir on pc7 dbmaint eqiad codfw [[phab:T395983|T395983]] * 06:15 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77091 and previous config saved to /var/cache/conftool/dbconfig/20250605-061502-marostegui.json * 06:12 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc7 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77090 and previous config saved to /var/cache/conftool/dbconfig/20250605-061200-marostegui.json * 06:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2017.codfw.wmnet,pc1017.eqiad.wmnet with reason: Maintenance * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77089 and previous config saved to /var/cache/conftool/dbconfig/20250605-061124-root.json * 05:57 marostegui: Change datadir on pc6 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77088 and previous config saved to /var/cache/conftool/dbconfig/20250605-055655-marostegui.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77087 and previous config saved to /var/cache/conftool/dbconfig/20250605-055619-root.json * 05:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2016.codfw.wmnet,pc1016.eqiad.wmnet with reason: Maintenance * 05:54 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc6 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77086 and previous config saved to /var/cache/conftool/dbconfig/20250605-055438-marostegui.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77085 and previous config saved to /var/cache/conftool/dbconfig/20250605-055349-marostegui.json * 05:51 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance * 05:51 marostegui: Change datadir on pc5 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:51 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc5 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77084 and previous config saved to /var/cache/conftool/dbconfig/20250605-055121-marostegui.json * 05:50 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77083 and previous config saved to /var/cache/conftool/dbconfig/20250605-055013-marostegui.json * 05:48 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance * 05:48 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc4 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77082 and previous config saved to /var/cache/conftool/dbconfig/20250605-054806-marostegui.json * 05:47 marostegui: Change datadir on pc4 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:47 marostegui: Change datadir on pc3 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77081 and previous config saved to /var/cache/conftool/dbconfig/20250605-054113-root.json * 05:36 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77080 and previous config saved to /var/cache/conftool/dbconfig/20250605-053647-marostegui.json * 05:33 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance * 05:33 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc3 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77079 and previous config saved to /var/cache/conftool/dbconfig/20250605-053317-marostegui.json * 05:29 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77078 and previous config saved to /var/cache/conftool/dbconfig/20250605-052934-marostegui.json * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'db2180 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77077 and previous config saved to /var/cache/conftool/dbconfig/20250605-052604-root.json * 05:25 marostegui: Change datadir on pc2 dbmaint eqiad codfw [[phab:T395983|T395983]] * 05:25 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2012.codfw.wmnet,pc1012.eqiad.wmnet with reason: Maintenance * 05:24 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc2 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77076 and previous config saved to /var/cache/conftool/dbconfig/20250605-052442-marostegui.json * 05:20 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2180.codfw.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2180 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77075 and previous config saved to /var/cache/conftool/dbconfig/20250605-052003-marostegui.json == 2025-06-04 == * 23:55 brett@cumin2002: END (PASS) - Cookbook sre.cdn.roll-restart-reboot-ncredir (exit_code=0) rolling reboot on A:ncredir * 22:45 brett@cumin2002: START - Cookbook sre.cdn.roll-restart-reboot-ncredir rolling reboot on A:ncredir * 22:30 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7016.magru.wmnet * 22:27 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 22:20 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7014.magru.wmnet * 22:18 damilare: SmashPig upgraded from {{Gerrit|d08693e5}} to {{Gerrit|3222a1f3}} * 22:16 ladsgroup@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] (duration: 13m 54s) * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7016.magru.wmnet * 22:12 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:12 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:11 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7015.magru.wmnet * 22:11 brett: sudo -i cumin 'A:ncredir' 'depool && apt-get update && apt-get upgrade -y && pool' -b1 -s10 * 22:09 ladsgroup@deploy1003: ladsgroup: Continuing with sync * 22:04 ladsgroup@deploy1003: ladsgroup: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 22:02 ladsgroup@deploy1003: Started scap sync-world: Backport for [[gerrit:1153725{{!}}Bump cache key version in EventStore (T396075)]] * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7015.magru.wmnet * 22:02 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7014.magru.wmnet * 22:02 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7013.magru.wmnet * 21:58 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7012.magru.wmnet * 21:43 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7013.magru.wmnet * 21:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7011.magru.wmnet * 21:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7012.magru.wmnet * 21:40 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:39 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:35 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7010.magru.wmnet * 21:29 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/scholarly/20250526/ using stat1011.eqiad.wmnet) * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7011.magru.wmnet * 21:25 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7010.magru.wmnet * 21:24 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/main/20250526/ using stat1009.eqiad.wmnet) * 21:22 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7009.magru.wmnet * 21:14 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7008.magru.wmnet * 21:07 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 21:06 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7009.magru.wmnet * 21:05 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7008.magru.wmnet * 21:04 cjming: end of UTC late backport window * 21:04 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7007.magru.wmnet * 21:02 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] (d * 21:01 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 20:55 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 20:54 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] synced to * 20:51 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153689{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153690{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]], [[gerrit:1153691{{!}}SUL3: Retry local login on failure due to invalid/expired login token (T390784)]], [[gerrit:1153692{{!}}SUL3: Retry local login on failure… (follow-ups) (T390784)]] * 20:51 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 20:50 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7006.magru.wmnet * 20:46 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7007.magru.wmnet * 20:44 robh@cumin2002: END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts cp7005.magru.wmnet * 20:40 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7006.magru.wmnet * 20:38 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] (duration: 15m 37s) * 20:37 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7004.magru.wmnet * 20:31 cjming@deploy1003: cjming, matmarex: Continuing with sync * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7005.magru.wmnet * 20:26 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7004.magru.wmnet * 20:25 cjming@deploy1003: cjming, matmarex: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:23 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153686{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]], [[gerrit:1153687{{!}}Treat File::getShortDesc() as possibly unsafe HTML (T395834)]] * 20:20 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7003.magru.wmnet * 20:18 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7002.magru.wmnet * 20:15 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] (duration: 10m 13s) * 20:10 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7003.magru.wmnet * 20:09 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7002.magru.wmnet * 20:08 cjming@deploy1003: lucaswerkmeister, cjming: Continuing with sync * 20:07 cjming@deploy1003: lucaswerkmeister, cjming: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:05 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153673{{!}}beta cluster: Set $wgOATHAuthAccountPrefix (T396061)]] * 20:03 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp3072.esams.wmnet * 19:54 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp3072.esams.wmnet * 19:42 robh@cumin2002: END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts cp7001.magru.wmnet * 19:36 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:31 robh@cumin2002: START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cp7001.magru.wmnet * 19:23 bking@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 19:22 bking@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 19:13 sukhe@dns1004: END - running authdns-update * 19:12 sukhe@dns1004: START - running authdns-update * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.wikimedia.org [reason: repooling after reboot] * 19:11 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=dns7002.magru.wmnet [reason: repooling after reboot] * 19:11 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for dns7002.wikimedia.org * 19:11 sukhe@cumin1002: START - Cookbook sre.hosts.remove-downtime for dns7002.wikimedia.org * 19:10 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-reboot (exit_code=0) rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 19:10 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7002.wikimedia.org * 19:10 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 19:10 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 19:10 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 19:10 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] (duration: 12m 27s) * 19:10 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 19:09 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 19:08 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 19:03 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 19:00 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 19:00 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 18:59 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 18:59 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 18:57 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153679{{!}}CustomBlockedDomainStorage::fetchConfig: Cast LinkTarget to a Title for RevisionStore (T396056)]] * 18:55 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7002.wikimedia.org * 18:51 brett@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.* * 18:49 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 18:45 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot finished rebooting dns7001.wikimedia.org * 18:37 brett@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.* * 18:37 brett: depooling cp7001 for CPU stress testing and temperature effects ([[phab:T373993|T373993]]) * 18:29 sukhe@cumin1002: cookbooks.sre.dns.roll-reboot begin reboot of dns7001.wikimedia.org * 18:29 sukhe@cumin1002: START - Cookbook sre.dns.roll-reboot rolling reboot on A:dnsbox and A:magru and (A:dnsbox) * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:24 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 18:18 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 18:16 damilare: SmashPig upgraded from {{Gerrit|a99f2265}} to {{Gerrit|d08693e5}} * 18:15 sukhe: puppet re-enabled on A:cp and finished rolling out removal of ats-be from single backend cp nodes: [[phab:T288106|T288106]] * 18:14 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 18:13 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 18:03 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] (duration: 10m 05s) * 17:56 bvibber@deploy1003: bvibber: Continuing with sync * 17:55 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 17:53 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153662{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]], [[gerrit:1153663{{!}}Update Charts so they can render from data-mw-charts as well as data-charts (T395462)]] * 17:50 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: sync * 17:49 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: sync * 17:48 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:48 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:36 btullis@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. * 17:35 btullis@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. * 17:15 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] (duration: 02m 39s) * 17:13 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153647}}: mediawiki: Fix captcha configmap structure - [[phab:T388531|T388531]] * 16:41 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1185.eqiad.wmnet with OS bullseye * 16:33 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:32 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:30 vriley@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1186 * 16:30 vriley@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host an-worker1186 * 16:27 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:27 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 16:22 vriley@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:22 vriley@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt an-worker1186 - vriley@cumin1002" * 16:19 stevemunene@cumin1002: END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:18 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1165.eqiad.wmnet * 16:16 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1165.eqiad.wmnet * 16:14 vriley@cumin1002: START - Cookbook sre.dns.netbox * 16:14 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:13 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:13 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:12 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1164.eqiad.wmnet * 16:12 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-durum (exit_code=0) rolling reboot on A:durum and not A:magru and A:durum * 16:12 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:11 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling reboot on A:wikidough and not A:magru and A:wikidough * 16:10 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1164.eqiad.wmnet * 16:07 stevemunene@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1163.eqiad.wmnet * 16:04 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 16:04 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 16:03 stevemunene@cumin1002: START - Cookbook sre.hosts.reboot-single for host an-worker1163.eqiad.wmnet * 16:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77074 and previous config saved to /var/cache/conftool/dbconfig/20250604-160120-fceratto.json * 15:56 vriley@cumin1002: END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:56 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1186.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:46 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77073 and previous config saved to /var/cache/conftool/dbconfig/20250604-154611-fceratto.json * 15:42 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet * 15:31 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P77072 and previous config saved to /var/cache/conftool/dbconfig/20250604-153104-fceratto.json * 15:30 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet * 15:29 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet * 15:24 dreamyjazz@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] (duration: 10m 03s) * 15:20 vriley@cumin1002: START - Cookbook sre.hosts.reimage for host an-worker1185.eqiad.wmnet with OS bullseye * 15:20 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 15:18 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 15:17 dreamyjazz@deploy1003: dreamyjazz: Continuing with sync * 15:16 dreamyjazz@deploy1003: dreamyjazz: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77071 and previous config saved to /var/cache/conftool/dbconfig/20250604-151556-fceratto.json * 15:15 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-durum rolling reboot on A:durum and not A:magru and A:durum * 15:14 dreamyjazz@deploy1003: Started scap sync-world: Backport for [[gerrit:1153646{{!}}Set wgCheckUserDisableCheckUserAPI to false on loginwiki (T396010)]] * 15:14 sukhe@cumin1002: START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling reboot on A:wikidough and not A:magru and A:wikidough * 15:14 vriley@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 15:08 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1226 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77070 and previous config saved to /var/cache/conftool/dbconfig/20250604-150740-fceratto.json * 15:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Maintenance * 15:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77069 and previous config saved to /var/cache/conftool/dbconfig/20250604-150716-fceratto.json * 15:05 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop (duration: 02m 52s) * 15:04 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:04 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 15:02 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: Chart bump, noop * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 cmooney@cumin1002: END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) * 15:00 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 15:00 stevemunene@cumin1002: END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1163.eqiad.wmnet * 14:58 stevemunene@cumin1002: START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1163.eqiad.wmnet * 14:58 moritzm: installing Linux 5.10.237 on Bullseye hosts * 14:55 cmooney@dns2005: END - running authdns-update * 14:54 cmooney@dns2005: START - running authdns-update * 14:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77068 and previous config saved to /var/cache/conftool/dbconfig/20250604-145209-fceratto.json * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:50 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: remove DNS entries for IPs used in Nokia test lab codfw - cmooney@cumin1002" * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1094.eqiad.wmnet with OS bullseye * 14:49 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:49 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet * 14:47 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet * 14:46 vriley@cumin1002: START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 14:46 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:46 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1095.eqiad.wmnet with OS bullseye * 14:46 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:44 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 14:43 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 14:41 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet * 14:38 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:37 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P77067 and previous config saved to /var/cache/conftool/dbconfig/20250604-143702-fceratto.json * 14:36 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 33s) * 14:33 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:32 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:32 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:31 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet * 14:31 cgoubert@deploy1003: Finished scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] (duration: 02m 24s) * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet * 14:29 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet * 14:28 cgoubert@deploy1003: Started scap sync-world: {{Gerrit|1153634}}: mediawiki: Fix captcha wordlists path - [[phab:T388531|T388531]] * 14:28 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 14:27 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 14:26 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 14:25 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 14:24 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts doh7001.wikimedia.org * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:24 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: doh7001.wikimedia.org decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:23 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet * 14:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77066 and previous config saved to /var/cache/conftool/dbconfig/20250604-142155-fceratto.json * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 14:21 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 14:20 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts durum7001.magru.wmnet * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:19 sukhe@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 sukhe@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: durum7001.magru.wmnet decommissioned, removing all IPs except the asset tag one - sukhe@cumin1002" * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 14:19 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:17 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet * 14:14 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 14:13 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1214 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77065 and previous config saved to /var/cache/conftool/dbconfig/20250604-141238-fceratto.json * 14:12 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance * 14:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77064 and previous config saved to /var/cache/conftool/dbconfig/20250604-141213-fceratto.json * 14:11 sukhe@cumin1002: START - Cookbook sre.dns.netbox * 14:11 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet * 14:08 sukhe: decommissioning doh7001 and durum7001: [[phab:T396015|T396015]] * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts doh7001.wikimedia.org * 14:07 jforrester@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply * 14:07 jforrester@deploy1003: helmfile [codfw] START helmfile.d/services/wikifunctions: apply * 14:07 sukhe@cumin1002: START - Cookbook sre.hosts.decommission for hosts durum7001.magru.wmnet * 14:06 jforrester@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply * 14:06 jforrester@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifunctions: apply * 14:05 jforrester@deploy1003: helmfile [staging] START helmfile.d/services/wikifunctions: apply * 14:04 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:04 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 14:02 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 14:02 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 14:01 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:58 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77062 and previous config saved to /var/cache/conftool/dbconfig/20250604-135706-fceratto.json * 13:56 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1094.eqiad.wmnet with reason: host reimage * 13:54 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol2010-dev.codfw.wmnet with reason: host reimage * 13:51 claime: Manual run of generatecaptcha on mw-cron, no delete - [[phab:T388531|T388531]] * 13:51 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 13:50 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 13:49 sukhe: sudo cumin -b1 -s15 'A:cp' 'run-puppet-agent --enable "merging CR 1114074"': [[phab:T288106|T288106]] * 13:48 vgutierrez@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:48 vgutierrez@cumin1002: START - Cookbook sre.loadbalancer.admin config_reloading P<nowiki>{</nowiki>lvs1013.eqiad.wmnet<nowiki>}</nowiki> and A:liberica * 13:47 tappof@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet * 13:46 sukhe: forcing ats-backend-restart on cp1104 * 13:43 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77061 and previous config saved to /var/cache/conftool/dbconfig/20250604-134336-root.json * 13:43 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P77060 and previous config saved to /var/cache/conftool/dbconfig/20250604-134158-fceratto.json * 13:41 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:41 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:40 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:40 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] (duration: 09m 57s) * 13:39 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1095.eqiad.wmnet with reason: host reimage * 13:38 tappof@cumin1002: END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host grafana2001.codfw.wmnet * 13:38 tappof@cumin1002: START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet * 13:37 sukhe: forcing agent run on cp2037 (non-single BE node): CR {{Gerrit|1114074}} * 13:37 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1094.eqiad.wmnet with OS bullseye * 13:33 samtar@deploy1003: samtar: Continuing with sync * 13:32 samtar@deploy1003: samtar: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:30 sukhe: forcing agent run on cp7001 (single BE node): CR {{Gerrit|1114074}} * 13:30 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153623{{!}}IS: Undo turning on wgTemplateDataEnableCategoryBrowser for mw.org (T377975)]] * 13:29 sukhe: forcing agent run on cp6015: CR {{Gerrit|1114074}} * 13:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77059 and previous config saved to /var/cache/conftool/dbconfig/20250604-132829-root.json * 13:27 vgutierrez@cumin1002: END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1013.eqiad.wmnet * 13:27 vgutierrez@cumin1002: START - Cookbook sre.hosts.remove-downtime for lvs1013.eqiad.wmnet * 13:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77058 and previous config saved to /var/cache/conftool/dbconfig/20250604-132648-fceratto.json * 13:23 sukhe: starting removal of ats-be service from eqiad, eqsin, esams, magru, ulsfo: [[phab:T288106|T288106]] * 13:21 sukhe: sudo cumin 'A:cp' 'disable-puppet "merging CR 1114074"' * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host ms-be1095.eqiad.wmnet with OS bullseye * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1209 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77057 and previous config saved to /var/cache/conftool/dbconfig/20250604-131852-fceratto.json * 13:18 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance * 13:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77056 and previous config saved to /var/cache/conftool/dbconfig/20250604-131827-fceratto.json * 13:14 jforrester@deploy1003: Finished scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] (duration: 10m 29s) * 13:13 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77055 and previous config saved to /var/cache/conftool/dbconfig/20250604-131323-root.json * 13:11 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply * 13:11 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply * 13:08 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 13:07 jforrester@deploy1003: jforrester, mhorsey: Continuing with sync * 13:06 jforrester@deploy1003: jforrester, mhorsey: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] synced to the testservers (see https://wikitech.wikimedia * 13:04 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:04 jforrester@deploy1003: Started scap sync-world: Backport for [[gerrit:1146628{{!}}release CampaignEvents to cbk-zam wiki (T393604)]], [[gerrit:1153385{{!}}Bump portals to the 2025-06-02 09:23:11+00:00 build (T128546)]], [[gerrit:1151781{{!}}build: Rename the rarely-used 'typos' script to 'checkTypos']], [[gerrit:1151751{{!}}Drop Chart roll-out dblists, no longer needed (T383079)]] * 13:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P77054 and previous config saved to /var/cache/conftool/dbconfig/20250604-130319-fceratto.json * 13:03 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 13:02 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 13:02 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 13:01 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 13:01 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 12:58 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77053 and previous config saved to /var/cache/conftool/dbconfig/20250604-125817-root.json * 12:43 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77051 and previous config saved to /var/cache/conftool/dbconfig/20250604-124311-root.json * 12:43 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1004.eqiad.wmnet * 12:42 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1003.eqiad.wmnet * 12:39 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply * 12:39 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 12:39 jiji@deploy1003: helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply * 12:37 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1095.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:36 moritzm: installing modsecurity-apache security updates * 12:36 jclark@cumin1002: START - Cookbook sre.hosts.provision for host ms-be1094.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 12:35 andrew@cumin1002: START - Cookbook sre.hosts.reboot-single for host cloudgw1003.eqiad.wmnet * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:35 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:35 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for ms-be1094/95 - jclark@cumin1002" * 12:34 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77050 and previous config saved to /var/cache/conftool/dbconfig/20250604-123304-fceratto.json * 12:32 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77049 and previous config saved to /var/cache/conftool/dbconfig/20250604-122948-root.json * {{safesubst:SAL entry|1=12:28 reedy@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were rea}} * 12:28 marostegui@cumin1002: dbctl commit (dc=all): 'db2217 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77048 and previous config saved to /var/cache/conftool/dbconfig/20250604-122806-root.json * 12:27 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:27 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:27 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcontrol2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:26 cmooney@cumin1002: START - Cookbook sre.dns.wipe-cache cloudcephosd2010-dev.codfw.wmnet on all recursors * 12:25 jclark@cumin1002: START - Cookbook sre.dns.netbox * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:25 cmooney@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:25 cmooney@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: fix entries for cloudcontrol2010-dev which had been added on wrong vlan - cmooney@cumin1002" * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1203 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77047 and previous config saved to /var/cache/conftool/dbconfig/20250604-122436-fceratto.json * 12:24 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance * 12:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77046 and previous config saved to /var/cache/conftool/dbconfig/20250604-122411-fceratto.json * 12:23 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2217 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77045 and previous config saved to /var/cache/conftool/dbconfig/20250604-122303-marostegui.json * 12:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2217.codfw.wmnet with reason: Maintenance * 12:21 cmooney@cumin1002: START - Cookbook sre.dns.netbox * 12:21 reedy@deploy1003: reedy: Continuing with sync * 12:21 cmooney@cumin1002: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * {{safesubst:SAL entry|1=12:20 reedy@deploy1003: reedy: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read from wordlist (T3}} * 12:18 cmooney@cumin1002: START - Cookbook sre.dns.netbox * {{safesubst:SAL entry|1=12:18 reedy@deploy1003: Started scap sync-world: Backport for [[gerrit:1153591{{!}}GenerateFancyCaptchas: Handle captcha.py not generating any captchas, but not erroring (T388531)]], [[gerrit:1153592{{!}}captcha.py: Expand variables and user in filenames (T395810)]], [[gerrit:1153593{{!}}captcha.py: Check if output dir exists, and attempt to create it (else error) (T395804)]], [[gerrit:1153595{{!}}captcha.py: Bail out if no words were read}} * 12:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77044 and previous config saved to /var/cache/conftool/dbconfig/20250604-121442-root.json * 12:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 12:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77043 and previous config saved to /var/cache/conftool/dbconfig/20250604-120904-fceratto.json * 11:59 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77041 and previous config saved to /var/cache/conftool/dbconfig/20250604-115936-root.json * 11:58 samtar@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] (duration: 12m 28s) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts krb1001.eqiad.wmnet * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:55 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:53 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P77040 and previous config saved to /var/cache/conftool/dbconfig/20250604-115357-fceratto.json * 11:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: krb1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" * 11:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 11:51 samtar@deploy1003: samtar: Continuing with sync * 11:47 samtar@deploy1003: samtar: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:45 samtar@deploy1003: Started scap sync-world: Backport for [[gerrit:1153581{{!}}IS/IS-labs: Enable TemplateDiscovery flags for mediawikiwiki (T377975)]] * 11:44 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77038 and previous config saved to /var/cache/conftool/dbconfig/20250604-114430-root.json * 11:38 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77037 and previous config saved to /var/cache/conftool/dbconfig/20250604-113849-fceratto.json * 11:38 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:35 mvolz@deploy1003: helmfile [eqiad] DONE helmfile.d/services/citoid: apply * 11:35 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:35 mvolz@deploy1003: helmfile [eqiad] START helmfile.d/services/citoid: apply * 11:34 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet * 11:34 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet * 11:33 mvolz@deploy1003: helmfile [codfw] DONE helmfile.d/services/citoid: apply * 11:32 mvolz@deploy1003: helmfile [codfw] START helmfile.d/services/citoid: apply * 11:32 jmm@cumin1003: START - Cookbook sre.hosts.decommission for hosts krb1001.eqiad.wmnet * 11:31 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:31 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1192 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77036 and previous config saved to /var/cache/conftool/dbconfig/20250604-113030-fceratto.json * 11:30 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance * 11:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77035 and previous config saved to /var/cache/conftool/dbconfig/20250604-113005-fceratto.json * 11:29 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77034 and previous config saved to /var/cache/conftool/dbconfig/20250604-112923-root.json * 11:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet * 11:25 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:25 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] DONE helmfile.d/services/citoid: apply * 11:22 mvolz@deploy1003: helmfile [staging] START helmfile.d/services/citoid: apply * 11:20 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet * 11:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77033 and previous config saved to /var/cache/conftool/dbconfig/20250604-111457-fceratto.json * 11:14 claime: Deployed k8s-controller-sidecars version 1.0.2-3 * 11:14 marostegui@cumin1002: dbctl commit (dc=all): 'db2193 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77032 and previous config saved to /var/cache/conftool/dbconfig/20250604-111418-root.json * 11:10 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2193.codfw.wmnet with reason: Maintenance * 11:09 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2193 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P77031 and previous config saved to /var/cache/conftool/dbconfig/20250604-110955-marostegui.json * 11:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 11:06 cgoubert@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 11:05 cgoubert@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 11:04 cgoubert@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 11:03 cgoubert@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 10:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P77030 and previous config saved to /var/cache/conftool/dbconfig/20250604-105950-fceratto.json * 10:52 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:51 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:45 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 10:44 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 10:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77029 and previous config saved to /var/cache/conftool/dbconfig/20250604-104443-fceratto.json * 10:38 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:37 jmm@cumin1003: START - Cookbook sre.ganeti.addnode for new host ganeti7001.magru.wmnet to cluster magru03 and group B * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1178 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77028 and previous config saved to /var/cache/conftool/dbconfig/20250604-103629-fceratto.json * 10:36 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance * 10:36 moritzm: failover ganeti master in ulsfo to ganeti4005 * 10:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77027 and previous config saved to /var/cache/conftool/dbconfig/20250604-103604-fceratto.json * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4007.ulsfo.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7001.magru.wmnet * 10:35 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet * 10:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77026 and previous config saved to /var/cache/conftool/dbconfig/20250604-103233-root.json * 10:29 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4007.ulsfo.wmnet * 10:25 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7001.magru.wmnet * 10:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77025 and previous config saved to /var/cache/conftool/dbconfig/20250604-102351-root.json * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4006.ulsfo.wmnet * 10:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4006.ulsfo.wmnet * 10:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77024 and previous config saved to /var/cache/conftool/dbconfig/20250604-102056-fceratto.json * 10:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77023 and previous config saved to /var/cache/conftool/dbconfig/20250604-101728-root.json * 10:17 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4006.ulsfo.wmnet * 10:09 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4006.ulsfo.wmnet * 10:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77022 and previous config saved to /var/cache/conftool/dbconfig/20250604-100846-root.json * 10:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P77021 and previous config saved to /var/cache/conftool/dbconfig/20250604-100547-fceratto.json * 10:05 vgutierrez@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs1013.eqiad.wmnet with reason: switching to katran * 10:04 vgutierrez: upload liberica 0.15 to bookwork-wikimedia (apt.wm.o) - [[phab:T395228|T395228]] * 10:02 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77020 and previous config saved to /var/cache/conftool/dbconfig/20250604-100222-root.json * 10:00 vgutierrez: depool lvs1013 before switching to katran - [[phab:T395228|T395228]] * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:59 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:55 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 09:55 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 09:53 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77018 and previous config saved to /var/cache/conftool/dbconfig/20250604-095340-root.json * 09:52 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:52 akosiaris: re-deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 09:52 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77017 and previous config saved to /var/cache/conftool/dbconfig/20250604-095041-fceratto.json * 09:47 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77016 and previous config saved to /var/cache/conftool/dbconfig/20250604-094715-root.json * 09:46 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply * 09:46 akosiaris: [[phab:T395451|T395451]] deploy mw-jobrunner hot patch for VirtualHost selection, testing out that the single version change will work this time around. * 09:46 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply * 09:42 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1177 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77015 and previous config saved to /var/cache/conftool/dbconfig/20250604-094217-fceratto.json * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4005.ulsfo.wmnet * 09:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance * 09:42 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4005.ulsfo.wmnet * 09:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P77014 and previous config saved to /var/cache/conftool/dbconfig/20250604-094152-fceratto.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti7001.magru.wmnet with OS bookworm * 09:38 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P77013 and previous config saved to /var/cache/conftool/dbconfig/20250604-093835-root.json * 09:37 hnowlan@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 09:37 hnowlan@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 09:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti4005.ulsfo.wmnet * 09:33 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4005.ulsfo.wmnet * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P77011 and previous config saved to /var/cache/conftool/dbconfig/20250604-093251-root.json * 09:32 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77010 and previous config saved to /var/cache/conftool/dbconfig/20250604-093209-root.json * 09:28 marostegui@cumin1002: dbctl commit (dc=all): 'Repool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77009 and previous config saved to /var/cache/conftool/dbconfig/20250604-092819-marostegui.json * 09:27 marostegui: Move datadir on pc1011 dbmaint pc1 eqiad [[phab:T395983|T395983]] * 09:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77008 and previous config saved to /var/cache/conftool/dbconfig/20250604-092645-fceratto.json * 09:24 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=x3 * 09:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P77007 and previous config saved to /var/cache/conftool/dbconfig/20250604-092328-root.json * 09:20 marostegui: Move datadir on pc2011 dbmaint pc1 codfw [[phab:T395983|T395983]] * 09:19 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2011.codfw.wmnet,pc1011.eqiad.wmnet with reason: Maintenance * 09:19 marostegui@cumin1002: dbctl commit (dc=all): 'Depool pc1 [[phab:T395983|T395983]]', diff saved to https://phabricator.wikimedia.org/P77006 and previous config saved to /var/cache/conftool/dbconfig/20250604-091921-marostegui.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P77005 and previous config saved to /var/cache/conftool/dbconfig/20250604-091745-root.json * 09:17 marostegui@cumin1002: dbctl commit (dc=all): 'es2044 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77004 and previous config saved to /var/cache/conftool/dbconfig/20250604-091704-root.json * 09:16 hnowlan@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 09:16 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:16 hnowlan@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 09:15 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 09:15 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 404s. * 09:14 akosiaris: [[phab:T395451|T395451]] rollback the host header addition, this is erroring out, returning 3xx. * 09:14 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:14 jmm@deploy1003: helmfile [eqiad] DONE helmfile.d/services/thumbor: apply * 09:14 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:12 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti7001.magru.wmnet with reason: host reimage * 09:11 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P77003 and previous config saved to /var/cache/conftool/dbconfig/20250604-091138-fceratto.json * 09:11 jmm@deploy1003: helmfile [eqiad] START helmfile.d/services/thumbor: apply * 09:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2044.codfw.wmnet with reason: Maintenance * 09:10 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2044', diff saved to https://phabricator.wikimedia.org/P77002 and previous config saved to /var/cache/conftool/dbconfig/20250604-091041-marostegui.json * 09:10 jmm@deploy1003: helmfile [codfw] DONE helmfile.d/services/thumbor: apply * 09:10 moritzm: installing qemu bugfix updates from Bookworm point release * 09:08 ayounsi@cumin1002: END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-f1-codfw * 09:08 marostegui@cumin1002: dbctl commit (dc=all): 'es2041 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P77001 and previous config saved to /var/cache/conftool/dbconfig/20250604-090823-root.json * 09:06 ayounsi@cumin1002: START - Cookbook sre.network.tls for network device lsw1-f1-codfw * 09:05 jmm@deploy1003: helmfile [codfw] START helmfile.d/services/thumbor: apply * 09:03 akosiaris@deploy1003: helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 akosiaris@deploy1003: helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P77000 and previous config saved to /var/cache/conftool/dbconfig/20250604-090240-root.json * 09:02 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2041', diff saved to https://phabricator.wikimedia.org/P76999 and previous config saved to /var/cache/conftool/dbconfig/20250604-090226-marostegui.json * 09:02 akosiaris@deploy1003: helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply * 09:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2041.codfw.wmnet with reason: Maintenance * 09:01 akosiaris@deploy1003: helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply * 08:57 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:57 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76998 and previous config saved to /var/cache/conftool/dbconfig/20250604-085630-fceratto.json * 08:53 jmm@deploy1003: helmfile [staging] DONE helmfile.d/services/thumbor: apply * 08:53 jmm@deploy1003: helmfile [staging] START helmfile.d/services/thumbor: apply * 08:51 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ganeti7001.magru.wmnet with OS bookworm * 08:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1172 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76997 and previous config saved to /var/cache/conftool/dbconfig/20250604-084819-fceratto.json * 08:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance * 08:47 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76996 and previous config saved to /var/cache/conftool/dbconfig/20250604-084735-root.json * 08:42 akosiaris: deploy changeprop-jobqueue to set the Host HTTP header for submission of all jobs. [[phab:T395451|T395451]] * 08:42 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance * 08:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76995 and previous config saved to /var/cache/conftool/dbconfig/20250604-084231-fceratto.json * 08:42 akosiaris@deploy1003: helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply * 08:41 akosiaris@deploy1003: helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply * 08:38 moritzm: removing ganeti7001 from magru01 cluster [[phab:T394263|T394263]] * 08:38 marostegui: Change s6 eqiad dbmaint to SBR [[phab:T383795|T383795]] * 08:38 akosiaris: revoke and clean helm-charts.discovery.wmnet old cergen cert from puppetmaster1001 * 08:32 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76994 and previous config saved to /var/cache/conftool/dbconfig/20250604-083229-root.json * 08:28 marostegui: Change s6 codfw dbmaint to SBR [[phab:T383795|T383795]] * 08:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76993 and previous config saved to /var/cache/conftool/dbconfig/20250604-082725-fceratto.json * 08:17 marostegui@cumin1002: dbctl commit (dc=all): 'db2224 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76992 and previous config saved to /var/cache/conftool/dbconfig/20250604-081725-root.json * 08:14 moritzm: removing atlas7001 from magru01 cluster [[phab:T394263|T394263]] * 08:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P76991 and previous config saved to /var/cache/conftool/dbconfig/20250604-081219-fceratto.json * 08:11 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2224.codfw.wmnet with reason: Maintenance * 08:11 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db2224 [[phab:T395989|T395989]]', diff saved to https://phabricator.wikimedia.org/P76990 and previous config saved to /var/cache/conftool/dbconfig/20250604-081058-marostegui.json * 08:05 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76989 and previous config saved to /var/cache/conftool/dbconfig/20250604-080546-root.json * 08:03 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt2002.wikimedia.org * 08:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt2002.wikimedia.org * 07:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76988 and previous config saved to /var/cache/conftool/dbconfig/20250604-075711-fceratto.json * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76987 and previous config saved to /var/cache/conftool/dbconfig/20250604-075041-root.json * 07:48 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76986 and previous config saved to /var/cache/conftool/dbconfig/20250604-074850-fceratto.json * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance * 07:48 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance * 07:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76985 and previous config saved to /var/cache/conftool/dbconfig/20250604-073921-root.json * 07:37 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of netflow7001.magru.wmnet to plain * 07:37 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of netflow7001.magru.wmnet to plain * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76984 and previous config saved to /var/cache/conftool/dbconfig/20250604-073535-root.json * 07:34 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of durum7001.magru.wmnet to plain * 07:31 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of durum7001.magru.wmnet to plain * 07:24 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76983 and previous config saved to /var/cache/conftool/dbconfig/20250604-072416-root.json * 07:23 Emperor: restart swift-object-replicator ms-be2066 * 07:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76982 and previous config saved to /var/cache/conftool/dbconfig/20250604-072030-root.json * 07:19 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of doh7001.wikimedia.org to plain * 07:18 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of doh7001.wikimedia.org to plain * 07:16 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of ncredir7001.magru.wmnet to plain * 07:13 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of ncredir7001.magru.wmnet to plain * 07:09 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.changedisk (exit_code=0) for changing disk type of install7001.wikimedia.org to plain * 07:09 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76981 and previous config saved to /var/cache/conftool/dbconfig/20250604-070910-root.json * 07:08 jmm@cumin1003: START - Cookbook sre.ganeti.changedisk for changing disk type of install7001.wikimedia.org to plain * 07:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76980 and previous config saved to /var/cache/conftool/dbconfig/20250604-070525-root.json * 07:01 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti7001.magru.wmnet * 06:57 jmm@cumin1003: START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti7001.magru.wmnet * 06:54 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76979 and previous config saved to /var/cache/conftool/dbconfig/20250604-065405-root.json * 06:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76978 and previous config saved to /var/cache/conftool/dbconfig/20250604-065020-root.json * 06:39 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76977 and previous config saved to /var/cache/conftool/dbconfig/20250604-063900-root.json * 06:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76976 and previous config saved to /var/cache/conftool/dbconfig/20250604-063515-root.json * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:33 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 06:31 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 52s) * 06:24 marostegui@deploy1003: marostegui: Continuing with sync * 06:24 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'es2048 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76974 and previous config saved to /var/cache/conftool/dbconfig/20250604-062355-root.json * 06:21 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153432{{!}}Revert "db-production.php: Disable writes on es7"]] * 06:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76973 and previous config saved to /var/cache/conftool/dbconfig/20250604-062010-root.json * 06:04 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1035 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76972 and previous config saved to /var/cache/conftool/dbconfig/20250604-060413-marostegui.json * 06:03 marostegui@dns1006: END - running authdns-update * 06:03 marostegui@dns1006: START - running authdns-update * 06:02 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1039 to es7 primary [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76971 and previous config saved to /var/cache/conftool/dbconfig/20250604-060246-marostegui.json * 06:02 marostegui: Starting es7 eqiad failover from es1035 to es1039 - [[phab:T395982|T395982]] * 05:57 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1039 with weight 0 [[phab:T395982|T395982]]', diff saved to https://phabricator.wikimedia.org/P76970 and previous config saved to /var/cache/conftool/dbconfig/20250604-055744-marostegui.json * 05:56 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] (duration: 13m 00s) * 05:49 marostegui@deploy1003: marostegui: Continuing with sync * 05:45 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:43 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153431{{!}}db-production.php: Disable writes on es7 (T395982)]] * 05:42 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395982|T395982]] * 00:38 eileen: civicrm upgraded from {{Gerrit|8eb67a94}} to {{Gerrit|22171c0b}} == 2025-06-03 == * 22:42 andrew@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 22:39 andrew@cumin1002: START - Cookbook sre.dns.netbox * 22:10 eileen: civicrm upgraded from {{Gerrit|3b59e784}} to {{Gerrit|8eb67a94}} * 22:09 sbassett@deploy1003: helmfile [eqiad] DONE helmfile.d/services/miscweb: apply * 21:59 sbassett@deploy1003: helmfile [eqiad] START helmfile.d/services/miscweb: apply * 21:53 tzatziki: removing 4 files for legal compliance * 21:48 sbassett@deploy1003: helmfile [codfw] DONE helmfile.d/services/miscweb: apply * 21:41 tzatziki: removing 2 files for legal compliance * 21:38 sbassett@deploy1003: helmfile [codfw] START helmfile.d/services/miscweb: apply * 21:38 sbassett@deploy1003: helmfile [staging] DONE helmfile.d/services/miscweb: apply * 21:28 sbassett@deploy1003: helmfile [staging] START helmfile.d/services/miscweb: apply * 21:21 mstyles@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] (duration: 11m 31s) * 21:18 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 21:14 mstyles@deploy1003: mstyles, sbassett: Continuing with sync * 21:11 mstyles@deploy1003: mstyles, sbassett: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 21:09 mstyles@deploy1003: Started scap sync-world: Backport for [[gerrit:1153351{{!}}Revert^2 "OATHAuth: Mark checkuser and suppress as requiring 2FA" (T150898)]] * 21:03 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] (duration: 09m 49s) * 20:56 cjming@deploy1003: matmarex, cjming: Continuing with sync * 20:55 cjming@deploy1003: matmarex, cjming: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:53 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1153350{{!}}Use default preference if no client preference in auth request (T395957)]] * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.clone (exit_code=99) of es2040.codfw.wmnet onto es2048.codfw.wmnet * 20:46 marostegui@cumin1002: END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 20:37 cscott@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] (duration: 12m 41s) * 20:31 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 20:30 cscott@deploy1003: cscott: Continuing with sync * 20:27 cscott@deploy1003: cscott: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:25 cscott@deploy1003: Started scap sync-world: Backport for [[gerrit:1153341{{!}}Use ::getContentId() and ::clearContentId() from the Parsoid extension API]] * 20:18 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] (duration: 11m 18s) * 20:11 cjming@deploy1003: ksarabia, cjming: Continuing with sync * 20:09 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 20:08 cjming@deploy1003: ksarabia, cjming: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:06 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152860{{!}}Deploy survey to en at twenty percent (T389393)]] * 19:37 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:37 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1008.eqiad.wmnet) * 19:36 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:36 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs1020.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf/date=20250526/wiki=wikidata/ using stat1010.eqiad.wmnet) * 19:35 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:35 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata on wdqs1020.eqiad.wmnet from DumpsSource.NFS (munging data to /srv/wdqs/munged, /srv/wdqs/lex-munged) * 19:22 dduvall@deploy1003: rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 19:11 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1011.eqiad.wmnet) * 19:08 ryankemper@cumin2002: END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:07 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading scholarly_articles on wdqs1023.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=scholarly_articles/ using stat1009.eqiad.wmnet) * 19:05 ebernhardson@deploy1003: Finished deploy [wdqs/wdqs@fea7794]: 0.3.157 (duration: 17m 57s) * 19:00 ryankemper@cumin2002: START - Cookbook sre.wdqs.data-reload reloading wikidata_main on wdqs1022.eqiad.wmnet from DumpsSource.HDFS (hdfs:///wmf/data/discovery/wikidata/rdf_subgraphs/snapshot=20250526/wiki=wikidata/scope=wikidata_main/ using stat1009.eqiad.wmnet) * 18:47 ebernhardson@deploy1003: Started deploy [wdqs/wdqs@fea7794]: 0.3.157 * 18:08 aokoth@cumin1002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet * 18:00 aokoth@cumin1002: START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:50 gmodena@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 17:45 swfrench@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 17:42 isaranto@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:41 isaranto@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:39 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:39 isaranto@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . * 17:39 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:38 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:36 swfrench@deploy1003: Finished scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] (duration: 02m 10s) * 17:35 fnegri@cumin1002: conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 17:34 swfrench@deploy1003: Started scap sync-world: Scap test run after revert - [[phab:T389786|T389786]] * 17:19 swfrench@deploy1003: Started scap sync-world: Scap run to test newly enabled dse-k8s-eqiad deployment - [[phab:T388761|T388761]] [[phab:T389786|T389786]] * 17:17 bvibber@deploy1003: helmfile [codfw] DONE helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [codfw] START helmfile.d/services/chart-renderer: apply * 17:16 bvibber@deploy1003: helmfile [eqiad] DONE helmfile.d/services/chart-renderer: apply * 17:15 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:15 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:14 bvibber@deploy1003: helmfile [eqiad] START helmfile.d/services/chart-renderer: apply * 17:14 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 17:13 bvibber@deploy1003: helmfile [staging] DONE helmfile.d/services/chart-renderer: apply * 17:13 bvibber@deploy1003: helmfile [staging] START helmfile.d/services/chart-renderer: apply * 17:12 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 17:06 cdanis@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 17:01 cdanis@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:59 cdanis@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:57 fnegri@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on clouddb1019.eqiad.wmnet with reason: Debugging stuck queryies [[phab:T390767|T390767]] * 16:42 bvibber@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] (duration: 09m 54s) * 16:38 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 16:35 bvibber@deploy1003: bvibber: Continuing with sync * 16:35 bvibber@deploy1003: bvibber: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 16:34 sukhe@dns1004: END - running authdns-update * 16:33 sukhe: testing dummy authdns-update to ensure clean run after gc-authdns-git-repo.timer rnu * 16:33 sukhe@dns1004: START - running authdns-update * 16:32 bvibber@deploy1003: Started scap sync-world: Backport for [[gerrit:1153282{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]], [[gerrit:1153281{{!}}Fixes: Charts embedded in template rendering in Parsoid (T395462)]] * 16:23 jiji@deploy1003: Finished scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs (duration: 02m 58s) * 16:22 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 16:21 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 16:20 jiji@deploy1003: Started scap sync-world: [[phab:T276994|T276994]]: We merged a number of noop patches, sparing deployers the scary diffs * 16:12 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply * 16:12 jiji@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-debug: apply * 16:06 jiji@deploy1003: helmfile [codfw] DONE helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [codfw] START helmfile.d/admin 'apply'. * 16:04 jiji@deploy1003: helmfile [eqiad] DONE helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [eqiad] START helmfile.d/admin 'apply'. * 15:57 jiji@deploy1003: helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. * 15:55 jiji@deploy1003: helmfile [staging-eqiad] START helmfile.d/admin 'apply'. * 15:54 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:51 jiji@deploy1003: helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. * 15:50 jiji@deploy1003: helmfile [staging-codfw] START helmfile.d/admin 'apply'. * 15:18 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:18 moritzm: installing gcc-12 bugfix updates from Bookworm point releases (includes various run time libraries) * 15:17 andrew@cumin1002: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:16 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance * 15:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76966 and previous config saved to /var/cache/conftool/dbconfig/20250603-151552-fceratto.json * 15:10 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 15:10 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bullseye * 15:06 hashar: Restarted Gerrit due to issue with replication config {{!}} [[phab:T395887|T395887]] * 15:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76964 and previous config saved to /var/cache/conftool/dbconfig/20250603-150045-fceratto.json * 14:58 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:58 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add prometheus7002 - jmm@cumin1003" * 14:50 fnegri@cumin1002: conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 * 14:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host prometheus7002.magru.wmnet with OS bookworm * 14:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P76963 and previous config saved to /var/cache/conftool/dbconfig/20250603-144538-fceratto.json * 14:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76962 and previous config saved to /var/cache/conftool/dbconfig/20250603-143031-fceratto.json * 14:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:26 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1063.eqiad.wmnet * 14:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on prometheus7002.magru.wmnet with reason: host reimage * 14:23 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2195 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76961 and previous config saved to /var/cache/conftool/dbconfig/20250603-142314-fceratto.json * 14:23 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2195.codfw.wmnet with reason: Maintenance * 14:22 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76960 and previous config saved to /var/cache/conftool/dbconfig/20250603-142248-fceratto.json * 14:19 jmm@cumin1003: DONE (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: elastic1103.eqiad.wmnet * 14:07 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76959 and previous config saved to /var/cache/conftool/dbconfig/20250603-140740-fceratto.json * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351820|T351820]]) * 14:01 Amir1: dropping term store tables from s8 ([[phab:T351802|T351802]]) * 13:57 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 13:52 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P76957 and previous config saved to /var/cache/conftool/dbconfig/20250603-135233-fceratto.json * 13:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76956 and previous config saved to /var/cache/conftool/dbconfig/20250603-134935-root.json * 13:44 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 13:42 dbrant@deploy1003: helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply * 13:40 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 13:38 dbrant@deploy1003: helmfile [codfw] START helmfile.d/services/wikifeeds: apply * 13:37 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76954 and previous config saved to /var/cache/conftool/dbconfig/20250603-133725-fceratto.json * 13:34 dbrant@deploy1003: helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply * 13:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76953 and previous config saved to /var/cache/conftool/dbconfig/20250603-133429-root.json * 13:33 dbrant@deploy1003: helmfile [eqiad] START helmfile.d/services/wikifeeds: apply * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:32 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 13:31 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:28 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2181 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76952 and previous config saved to /var/cache/conftool/dbconfig/20250603-132802-fceratto.json * 13:27 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance * 13:27 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 13:27 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76951 and previous config saved to /var/cache/conftool/dbconfig/20250603-132735-fceratto.json * 13:26 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:22 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 13:21 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2050.codfw.wmnet * 13:20 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 13:20 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76950 and previous config saved to /var/cache/conftool/dbconfig/20250603-131959-root.json * 13:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76949 and previous config saved to /var/cache/conftool/dbconfig/20250603-131923-root.json * 13:18 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:17 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 13:16 moritzm: installing libavif security updates * 13:16 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:14 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2050.codfw.wmnet * 13:14 jgleeson: payments-wiki rolled back from {{Gerrit|def6c267}} to {{Gerrit|1a4ef678}} * 13:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 13:12 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 13:12 dbrant@deploy1003: helmfile [staging] DONE helmfile.d/services/wikifeeds: apply * 13:12 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76948 and previous config saved to /var/cache/conftool/dbconfig/20250603-131228-fceratto.json * 13:11 dbrant@deploy1003: helmfile [staging] START helmfile.d/services/wikifeeds: apply * 13:11 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76946 and previous config saved to /var/cache/conftool/dbconfig/20250603-130453-root.json * 13:04 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76945 and previous config saved to /var/cache/conftool/dbconfig/20250603-130418-root.json * 13:04 marostegui: Shutdown clouddb1016:x3 [[phab:T390954|T390954]] * 13:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 12:58 moritzm: uploaded wmf-laptop 1.0.2 to apt.wikimedia.org * 12:57 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167', diff saved to https://phabricator.wikimedia.org/P76943 and previous config saved to /var/cache/conftool/dbconfig/20250603-125721-fceratto.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76942 and previous config saved to /var/cache/conftool/dbconfig/20250603-124948-root.json * 12:49 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76941 and previous config saved to /var/cache/conftool/dbconfig/20250603-124913-root.json * 12:42 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76940 and previous config saved to /var/cache/conftool/dbconfig/20250603-124214-fceratto.json * 12:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2049.codfw.wmnet * 12:35 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2049.codfw.wmnet * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76939 and previous config saved to /var/cache/conftool/dbconfig/20250603-123442-root.json * 12:34 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76938 and previous config saved to /var/cache/conftool/dbconfig/20250603-123407-root.json * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2167 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76937 and previous config saved to /var/cache/conftool/dbconfig/20250603-123357-fceratto.json * 12:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance * 12:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76936 and previous config saved to /var/cache/conftool/dbconfig/20250603-123331-fceratto.json * 12:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 12:25 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2048 gradually with 4 steps - Pool es2048.codfw.wmnet in after cloning * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76935 and previous config saved to /var/cache/conftool/dbconfig/20250603-121937-root.json * 12:19 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76934 and previous config saved to /var/cache/conftool/dbconfig/20250603-121902-root.json * 12:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76933 and previous config saved to /var/cache/conftool/dbconfig/20250603-121824-fceratto.json * 12:16 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] (duration: 09m 47s) * 12:15 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:12 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 12:10 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:09 marostegui@deploy1003: marostegui: Continuing with sync * 12:09 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 12:08 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'readability' for release 'main' . * 12:07 claime: Launching manual run of recount-categories cronjob - [[phab:T395745|T395745]] * 12:06 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153135{{!}}Revert "db-production.php: Disable writes on es7"]] * 12:06 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 12:05 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 12:04 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:04 marostegui@cumin1002: dbctl commit (dc=all): 'es2038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76931 and previous config saved to /var/cache/conftool/dbconfig/20250603-120431-root.json * 12:03 marostegui@cumin1002: dbctl commit (dc=all): 'es1039 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76930 and previous config saved to /var/cache/conftool/dbconfig/20250603-120356-root.json * 12:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P76929 and previous config saved to /var/cache/conftool/dbconfig/20250603-120316-fceratto.json * 12:02 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'logo-detection' for release 'main' . * 12:00 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:59 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . * 11:56 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'edit-check' for release 'main' . * 11:54 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:52 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . * 11:50 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2038 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76927 and previous config saved to /var/cache/conftool/dbconfig/20250603-115026-marostegui.json * 11:49 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:49 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2039 to es7 primary and set section read-write [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76926 and previous config saved to /var/cache/conftool/dbconfig/20250603-114917-marostegui.json * 11:48 marostegui: Starting es7 codfw failover from es2038 to es2039 - [[phab:T395785|T395785]] * 11:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76925 and previous config saved to /var/cache/conftool/dbconfig/20250603-114809-fceratto.json * 11:46 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2039 with weight 0 [[phab:T395785|T395785]]', diff saved to https://phabricator.wikimedia.org/P76924 and previous config saved to /var/cache/conftool/dbconfig/20250603-114637-marostegui.json * 11:46 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:46 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . * 11:41 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2048.codfw.wmnet * 11:40 bwojtowicz@deploy1003: helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2166 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76923 and previous config saved to /var/cache/conftool/dbconfig/20250603-113952-fceratto.json * 11:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76922 and previous config saved to /var/cache/conftool/dbconfig/20250603-113946-marostegui.json * 11:39 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance * 11:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76921 and previous config saved to /var/cache/conftool/dbconfig/20250603-113924-fceratto.json * 11:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] (duration: 09m 56s) * 11:36 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2048.codfw.wmnet * 11:33 bwojtowicz@deploy1003: helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:32 marostegui@deploy1003: marostegui: Continuing with sync * 11:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 11:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1153130{{!}}db-production.php: Disable writes on es7 (T395647)]] * 11:25 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 11:22 taavi@cumin1002: conftool action : set/weight=100:pooled=yes; selector: name=clouddb1016.eqiad.wmnet,service=x3 * 11:21 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es7 [[phab:T395785|T395785]] * 11:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76919 and previous config saved to /var/cache/conftool/dbconfig/20250603-111822-fceratto.json * 11:09 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 11:03 jgleeson: payments-wiki upgraded from {{Gerrit|1a4ef678}} to {{Gerrit|def6c267}} * 11:03 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P76917 and previous config saved to /var/cache/conftool/dbconfig/20250603-110315-fceratto.json * 10:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 10:57 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2047.codfw.wmnet * 10:52 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2047.codfw.wmnet * 10:49 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2046.codfw.wmnet * 10:48 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76915 and previous config saved to /var/cache/conftool/dbconfig/20250603-104809-fceratto.json * 10:44 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:44 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2046.codfw.wmnet * 10:43 fnegri@cumin1002: END (FAIL) - Cookbook sre.wikireplicas.update-views (exit_code=99) * 10:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2045.codfw.wmnet * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2165 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76914 and previous config saved to /var/cache/conftool/dbconfig/20250603-104056-fceratto.json * 10:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance * 10:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76913 and previous config saved to /var/cache/conftool/dbconfig/20250603-104030-fceratto.json * 10:40 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti2045.codfw.wmnet * 10:38 cmooney@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 11 hosts with reason: silence alerts due to down BGP groups on cr2-codfw while PIC is reconfigured * 10:25 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76912 and previous config saved to /var/cache/conftool/dbconfig/20250603-102523-fceratto.json * 10:25 ladsgroup@cumin1002: dbctl commit (dc=all): 'Fix the repool dbctl commit', diff saved to https://phabricator.wikimedia.org/P76911 and previous config saved to /var/cache/conftool/dbconfig/20250603-102517-ladsgroup.json * 10:18 fnegri@cumin1002: START - Cookbook sre.wikireplicas.update-views * 10:13 topranks: drain cr2-codfw traffic to enable PIC port bw rebalence on slot 0 [[phab:T387504|T387504]] * 09:59 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 09:43 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 09:25 jelto@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 jelto@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab-runner1002.eqiad.wmnet with reason: host reimage * 09:22 elukey: puppet cert destroy <nowiki>{</nowiki>mobileapps,proton,recommendation-api<nowiki>}</nowiki>.discovery.wmnet on puppetmaster1001 - old certs not used anymore * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:18 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2040 gradually with 4 steps - Pool es2040.codfw.wmnet in after cloning * 09:18 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:16 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:15 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P76909 and previous config saved to /var/cache/conftool/dbconfig/20250603-091521-fceratto.json * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:10 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2008.codfw.wmnet with OS bullseye * 09:06 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 09:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76907 and previous config saved to /var/cache/conftool/dbconfig/20250603-090013-fceratto.json * 08:59 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2164 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76906 and previous config saved to /var/cache/conftool/dbconfig/20250603-085148-fceratto.json * 08:51 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance * 08:51 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76905 and previous config saved to /var/cache/conftool/dbconfig/20250603-085121-fceratto.json * 08:45 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . * 08:43 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . * 08:42 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:41 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . * 08:40 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . * 08:38 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 08:37 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . * 08:36 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76904 and previous config saved to /var/cache/conftool/dbconfig/20250603-083614-fceratto.json * 08:36 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . * 08:34 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 08:30 jelto@cumin1002: START - Cookbook sre.hosts.reimage for host gitlab-runner1002.eqiad.wmnet with OS bookworm * 08:22 moritzm: rearm keyholder on cumin1003 following reboot * 08:21 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P76903 and previous config saved to /var/cache/conftool/dbconfig/20250603-082107-fceratto.json * 08:16 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 08:14 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin1003.eqiad.wmnet * 08:10 jmm@cumin2002: START - Cookbook sre.hosts.reboot-single for host cumin1003.eqiad.wmnet * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2009.codfw.wmnet with OS bullseye * 08:10 mvernon@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 mvernon@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - mvernon@cumin1002" * 08:06 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76901 and previous config saved to /var/cache/conftool/dbconfig/20250603-080600-fceratto.json * 08:04 phuedx: Disabling the SDS 2.4.11 Synthetic A/A/ Test in xLab * 07:57 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2163 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76900 and previous config saved to /var/cache/conftool/dbconfig/20250603-075638-fceratto.json * 07:56 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 07:56 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76899 and previous config saved to /var/cache/conftool/dbconfig/20250603-075622-fceratto.json * 07:56 jmm@cumin1003: END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) * 07:49 mvernon@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:49 jmm@cumin1003: END (PASS) - Cookbook sre.misc-clusters.restart-reboot-config-master (exit_code=0) rolling reboot on A:config-master * 07:46 jmm@cumin1003: START - Cookbook sre.dns.netbox * 07:46 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 07:46 mvernon@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 07:44 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:44 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast7002.wikimedia.org * 07:43 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast7002.wikimedia.org with OS bookworm * 07:41 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76898 and previous config saved to /var/cache/conftool/dbconfig/20250603-074113-fceratto.json * 07:38 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) config-master.discovery.wmnet. on all recursors * 07:38 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache config-master.discovery.wmnet. on all recursors * 07:37 jmm@cumin1003: START - Cookbook sre.misc-clusters.restart-reboot-config-master rolling reboot on A:config-master * 07:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:26 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P76897 and previous config saved to /var/cache/conftool/dbconfig/20250603-072604-fceratto.json * 07:25 tchanders@deploy1003: Finished scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] (duration: 10m 39s) * 07:23 mvernon@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 07:23 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on bast7002.wikimedia.org with reason: host reimage * 07:18 tchanders@deploy1003: tchanders: Continuing with sync * 07:16 tchanders@deploy1003: tchanders: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:14 tchanders@deploy1003: Started scap sync-world: Backport for [[gerrit:1142649{{!}}Assign IP auto-reveal rights to certain groups (T386492)]] * 07:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76896 and previous config saved to /var/cache/conftool/dbconfig/20250603-071057-fceratto.json * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:06 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:02 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 07:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76895 and previous config saved to /var/cache/conftool/dbconfig/20250603-070155-root.json * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2154 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76894 and previous config saved to /var/cache/conftool/dbconfig/20250603-070036-fceratto.json * 07:00 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance * 07:00 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76893 and previous config saved to /var/cache/conftool/dbconfig/20250603-070021-fceratto.json * 06:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host bast7002.wikimedia.org with OS bookworm * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM bast7002.wikimedia.org - jmm@cumin1003" * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache bast7002.wikimedia.org on all recursors * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 06:54 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:53 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast7002.wikimedia.org - jmm@cumin1003" * 06:48 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host thanos-be2009.codfw.wmnet with OS bullseye * 06:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76892 and previous config saved to /var/cache/conftool/dbconfig/20250603-064649-root.json * 06:45 jmm@cumin1003: START - Cookbook sre.dns.netbox * 06:45 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host bast7002.wikimedia.org * 06:45 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76891 and previous config saved to /var/cache/conftool/dbconfig/20250603-064513-fceratto.json * 06:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76890 and previous config saved to /var/cache/conftool/dbconfig/20250603-064147-root.json * 06:37 marostegui: Decrease buffer size on clouddb1016:s8 [[phab:T390954|T390954]] * 06:31 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76889 and previous config saved to /var/cache/conftool/dbconfig/20250603-063144-root.json * 06:30 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1154.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:30 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P76887 and previous config saved to /var/cache/conftool/dbconfig/20250603-063004-fceratto.json * 06:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb1016.eqiad.wmnet with reason: Setting up x3 [[phab:T390954|T390954]] * 06:26 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76886 and previous config saved to /var/cache/conftool/dbconfig/20250603-062641-root.json * 06:16 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76885 and previous config saved to /var/cache/conftool/dbconfig/20250603-061638-root.json * 06:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76884 and previous config saved to /var/cache/conftool/dbconfig/20250603-061457-fceratto.json * 06:11 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76883 and previous config saved to /var/cache/conftool/dbconfig/20250603-061134-root.json * 06:07 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db2152 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76882 and previous config saved to /var/cache/conftool/dbconfig/20250603-060719-fceratto.json * 06:07 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance * 06:01 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76881 and previous config saved to /var/cache/conftool/dbconfig/20250603-060132-root.json * 05:56 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76880 and previous config saved to /var/cache/conftool/dbconfig/20250603-055628-root.json * 05:46 marostegui@cumin1002: dbctl commit (dc=all): 'es1037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76879 and previous config saved to /var/cache/conftool/dbconfig/20250603-054626-root.json * 05:41 marostegui@cumin1002: dbctl commit (dc=all): 'es2037 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76878 and previous config saved to /var/cache/conftool/dbconfig/20250603-054123-root.json * 05:39 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] (duration: 09m 52s) * 05:32 marostegui@deploy1003: marostegui: Continuing with sync * 05:31 marostegui@cumin1002: dbctl commit (dc=all): 'Give some weight to es1038', diff saved to https://phabricator.wikimedia.org/P76877 and previous config saved to /var/cache/conftool/dbconfig/20250603-053151-marostegui.json * 05:31 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:29 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152896{{!}}Revert "db-production.php: Disable writes on es6"]] * 05:27 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1037 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76876 and previous config saved to /var/cache/conftool/dbconfig/20250603-052719-marostegui.json * 05:27 marostegui@dns1006: END - running authdns-update * 05:26 marostegui@dns1006: START - running authdns-update * 05:26 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es1038 to es6 primary [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76875 and previous config saved to /var/cache/conftool/dbconfig/20250603-052614-marostegui.json * 05:25 marostegui: Starting es6 eqiad failover from es1037 to es1038 - [[phab:T395867|T395867]] * 05:23 marostegui@cumin1002: dbctl commit (dc=all): 'Set es1038 with weight 0 [[phab:T395867|T395867]]', diff saved to https://phabricator.wikimedia.org/P76874 and previous config saved to /var/cache/conftool/dbconfig/20250603-052353-marostegui.json * 05:22 marostegui@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] (duration: 13m 39s) * 05:15 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:14 marostegui@deploy1003: marostegui: Continuing with sync * 05:13 marostegui@deploy1003: marostegui: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 05:09 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395867|T395867]] * 05:09 marostegui@deploy1003: Started scap sync-world: Backport for [[gerrit:1152893{{!}}db-production.php: Disable writes on es6 (T395867)]] * 04:54 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2037.codfw.wmnet with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2037 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76873 and previous config saved to /var/cache/conftool/dbconfig/20250603-045251-marostegui.json * 04:52 marostegui@cumin1002: dbctl commit (dc=all): 'Promote es2035 to es6 primary and set section read-write [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76872 and previous config saved to /var/cache/conftool/dbconfig/20250603-045202-root.json * 04:51 marostegui: Starting es6 codfw failover from es2037 to es2035 - [[phab:T395420|T395420]] * 04:49 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 7 hosts with reason: Primary switchover es6 [[phab:T395420|T395420]] * 04:48 marostegui@cumin1002: dbctl commit (dc=all): 'Set es2035 with weight 0 [[phab:T395420|T395420]]', diff saved to https://phabricator.wikimedia.org/P76871 and previous config saved to /var/cache/conftool/dbconfig/20250603-044855-root.json * 04:45 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2048 to dbctl depooled [[phab:T395771|T395771]]+', diff saved to https://phabricator.wikimedia.org/P76870 and previous config saved to /var/cache/conftool/dbconfig/20250603-044550-marostegui.json * 04:40 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2040.codfw.wmnet onto es2048.codfw.wmnet * 04:31 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2040 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76869 and previous config saved to /var/cache/conftool/dbconfig/20250603-043151-marostegui.json * 04:29 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2040.codfw.wmnet with reason: Maintenance * 04:01 mwpresync@deploy1003: Pruned MediaWiki: 1.45.0-wmf.1 (duration: 01m 39s) * 03:48 mwpresync@deploy1003: Finished scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] (duration: 45m 55s) * 03:03 mwpresync@deploy1003: Started scap sync-world: testwikis to 1.45.0-wmf.4 refs [[phab:T392174|T392174]] * 00:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 00:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage == 2025-06-02 == * 23:53 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2008.codfw.wmnet with reason: host reimage * 23:30 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host thanos-be2008.codfw.wmnet with OS bullseye * 23:25 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 23:24 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:24 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 23:22 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:22 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 23:10 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 23:09 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" * 22:54 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:50 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage * 22:47 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:45 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage * 22:38 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:35 jclark@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage * 22:29 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bullseye * 22:16 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 22:08 maryum: scap sync-world finished to deploy several security bugs and PrivateSettings.php changes * 22:01 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bullseye * 21:53 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:38 tgr@deploy1003: Unlocked for deployment [MediaWiki]: [[phab:T395758|T395758]] (duration: 22m 32s) * 21:38 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:34 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 21:22 bking@cumin2002: END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:22 bking@cumin2002: START - Cookbook sre.elasticsearch.ban Banning hosts: cirrussearch205*,cirrussearch2060* for [[phab:T395855|T395855]] - bking@cumin2002 * 21:16 bking@cumin2002: conftool action : set/pooled=no; selector: name=cirrussearch2055.codfw.wmnet{{!}}cirrussearch2056.codfw.wmnet{{!}}cirrussearch2057.codfw.wmnet{{!}}cirrussearch2058.codfw.wmnet{{!}}cirrussearch2059.codfw.wmnet{{!}}cirrussearch2060.codfw.wmnet{{!}}cirrussearch2091.codfw.wmnet * 21:16 tgr@deploy1003: Locking from deployment [MediaWiki]: [[phab:T395758|T395758]] * 21:13 bking@cumin2002: conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch.*.codfw.wmnet * 21:11 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 21:06 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] (duration: 11m 41s) * 21:05 andrew@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 21:04 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bullseye * 20:59 cjming@deploy1003: cjming, ksarabia: Continuing with sync * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 20:59 andrew@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:56 cjming@deploy1003: cjming, ksarabia: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:56 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 20:55 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152801{{!}}Simple summaries survey for English (T389393)]] * 20:51 jsn@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] (duration: 12m 55s) * 20:45 jsn@deploy1003: jsn: Continuing with sync * 20:41 jsn@deploy1003: jsn: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:38 jsn@deploy1003: Started scap sync-world: Backport for [[gerrit:1152797{{!}}Undeploy first set of Patroller Tools surveys (T389401)]] * 20:36 arlolra@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] (duration: 10m 37s) * 20:29 arlolra@deploy1003: arlolra: Continuing with sync * 20:27 kamila@cumin1002: END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-worker1028 * 20:27 arlolra@deploy1003: arlolra: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:27 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-in1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:26 kamila@cumin1002: START - Cookbook sre.network.configure-switch-interfaces for host wikikube-worker1028 * 20:25 arlolra@deploy1003: Started scap sync-world: Backport for [[gerrit:1152165{{!}}Remove wgParserEnableLegacyHeadingDOM option (T371756)]] * 20:23 cjming@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] (duration: 15m 51s) * 20:22 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out1001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:18 andrew@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - andrew@cumin1002" * 20:16 cjming@deploy1003: cjming, phuedx: Continuing with sync * 20:16 jhathaway@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on mx-out2001.wikimedia.org with reason: [[phab:T395240|T395240]] * 20:10 cjming@deploy1003: cjming, phuedx: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 20:07 cjming@deploy1003: Started scap sync-world: Backport for [[gerrit:1152779{{!}}ext.xLab: Send limited copies of stream configs (T391988)]] * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:40 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 19:36 andrew@cumin1002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:32 andrew@cumin1002: START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1048.eqiad.wmnet with reason: host reimage * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:20 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:19 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3008.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:15 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:14 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3009.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:08 andrew@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:06 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin pooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.admin depooling P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 19:05 sukhe@cumin1002: START - Cookbook sre.loadbalancer.upgrade restarting P<nowiki>{</nowiki>lvs3010.esams.wmnet<nowiki>}</nowiki> and A:liberica * 18:46 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker1028.eqiad.wmnet * 18:46 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:44 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:34 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker1028.eqiad.wmnet * 18:33 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp1104.eqiad.wmnet,service=(cdn{{!}}ats-be) * 18:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76866 and previous config saved to /var/cache/conftool/dbconfig/20250602-183230-fceratto.json * 18:24 brett: include libvmod-wmfuniq 0.2.0~deb11u1 in bullseye-wikimedia * 18:23 brett: include libvmod-wmfuniq 0.2.0~deb12u1 in bookworm-wikimedia * 18:21 ebernhardson@deploy1003: Finished deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 (duration: 00m 29s) * 18:21 ebernhardson@deploy1003: Started deploy [airflow-dags/search@443d0ab]: bump glent to 0.3.6 * 18:17 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76865 and previous config saved to /var/cache/conftool/dbconfig/20250602-181722-fceratto.json * 18:10 jasmine@cumin1002: END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts wikikube-worker[1026-1028].eqiad.wmnet * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 18:10 jasmine@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:07 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:06 jclark@cumin1002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 18:05 jasmine@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-worker[1026-1028].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jasmine@cumin1002" * 18:02 jasmine@cumin1002: START - Cookbook sre.dns.netbox * 18:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254', diff saved to https://phabricator.wikimedia.org/P76864 and previous config saved to /var/cache/conftool/dbconfig/20250602-180216-fceratto.json * 17:50 jasmine@cumin1002: START - Cookbook sre.hosts.decommission for hosts wikikube-worker[1026-1028].eqiad.wmnet * 17:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76863 and previous config saved to /var/cache/conftool/dbconfig/20250602-174708-fceratto.json * 17:38 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1254 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76862 and previous config saved to /var/cache/conftool/dbconfig/20250602-173850-fceratto.json * 17:38 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1254.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance * 17:33 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76861 and previous config saved to /var/cache/conftool/dbconfig/20250602-173316-fceratto.json * 17:22 jclark@cumin1002: START - Cookbook sre.hosts.reimage for host cloudcephosd1048.eqiad.wmnet with OS bullseye * 17:18 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20250602-171804-fceratto.json * 17:05 jasmine@cumin1002: END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:04 jasmine@cumin1002: START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker[1026-1028].eqiad.wmnet * 17:02 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P76860 and previous config saved to /var/cache/conftool/dbconfig/20250602-170256-fceratto.json * 16:50 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:49 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76859 and previous config saved to /var/cache/conftool/dbconfig/20250602-164748-fceratto.json * 16:44 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:43 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host cloudcontrol2010-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1233 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76857 and previous config saved to /var/cache/conftool/dbconfig/20250602-164030-fceratto.json * 16:40 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1233.eqiad.wmnet with reason: Maintenance * 16:40 jhancock@cumin2002: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 16:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76856 and previous config saved to /var/cache/conftool/dbconfig/20250602-164003-fceratto.json * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:36 fceratto@deploy1003: helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . * 16:30 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76855 and previous config saved to /var/cache/conftool/dbconfig/20250602-162957-root.json * 16:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76854 and previous config saved to /var/cache/conftool/dbconfig/20250602-162455-fceratto.json * 16:22 sukhe: sudo cumin -b1 -s60 'A:cp and not P<nowiki>{</nowiki>cp7001*<nowiki>}</nowiki>' "depool cdn && sleep 10 && run-puppet-agent --enable 'merging CR 1091330' && systemctl restart trafficserver.service && sleep 10 && pool cdn" * 16:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76853 and previous config saved to /var/cache/conftool/dbconfig/20250602-161452-root.json * 16:09 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P76852 and previous config saved to /var/cache/conftool/dbconfig/20250602-160948-fceratto.json * 16:03 sukhe@puppetserver1001: conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: [end] testing CR {{Gerrit|1091330}}] * 15:59 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76851 and previous config saved to /var/cache/conftool/dbconfig/20250602-155946-root.json * 15:55 sukhe: enable puppet and run agent on cp7001 * 15:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76850 and previous config saved to /var/cache/conftool/dbconfig/20250602-155441-fceratto.json * 15:52 sukhe@puppetserver1001: conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [reason: testing CR {{Gerrit|1091330}}] * 15:50 sukhe: disable puppet on A:cp to merge CR: {{Gerrit|1091330}} * 15:49 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] (duration: 14m 23s) * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1229 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76849 and previous config saved to /var/cache/conftool/dbconfig/20250602-154734-fceratto.json * 15:47 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1229.eqiad.wmnet with reason: Maintenance * 15:47 cjming@deploy1003: helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply * 15:47 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76848 and previous config saved to /var/cache/conftool/dbconfig/20250602-154709-fceratto.json * 15:46 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:44 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76847 and previous config saved to /var/cache/conftool/dbconfig/20250602-154440-root.json * 15:42 jhancock@cumin2002: START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2009.codfw.wmnet with reason: host reimage * 15:42 phuedx@deploy1003: phuedx: Continuing with sync * 15:38 phuedx@deploy1003: phuedx: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 15:35 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152757{{!}}Enable MetricsPlatform's experimentation feature]] * 15:32 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76846 and previous config saved to /var/cache/conftool/dbconfig/20250602-153201-fceratto.json * 15:29 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76845 and previous config saved to /var/cache/conftool/dbconfig/20250602-152935-root.json * 15:27 joal@deploy1003: Finished deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] (duration: 00m 42s) * 15:26 joal@deploy1003: Started deploy [airflow-dags/analytics@03db055]: Regular analytics weekly train (with pull...) [airflow-dags/analytics_test@03db0552] * 15:21 thcipriani: jouncebot nowandnext * 15:19 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host cloudcontrol2010-dev.codfw.wmnet with OS bookworm * 15:16 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P76844 and previous config saved to /var/cache/conftool/dbconfig/20250602-151654-fceratto.json * 15:15 jhancock@cumin2002: START - Cookbook sre.hosts.reimage for host thanos-be2009.codfw.wmnet with OS bullseye * 15:14 marostegui@cumin1002: dbctl commit (dc=all): 'db1211 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76843 and previous config saved to /var/cache/conftool/dbconfig/20250602-151429-root.json * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] (duration: 00m 05s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics_test@4ebb376]: Regular analytics weekly train [airflow-dags/analytics_test@4ebb376f] * 15:03 joal@deploy1003: Finished deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] (duration: 00m 07s) * 15:03 joal@deploy1003: Started deploy [airflow-dags/analytics@afad011]: Regular analytics weekly train [airflow-dags/main@afad011c] * 15:01 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76842 and previous config saved to /var/cache/conftool/dbconfig/20250602-150146-fceratto.json * 15:00 phuedx: Disabled the SDS 2.4.11 Synthetic A/A Test in xLab * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1222 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76841 and previous config saved to /var/cache/conftool/dbconfig/20250602-145443-fceratto.json * 14:54 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance * 14:54 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76840 and previous config saved to /var/cache/conftool/dbconfig/20250602-145418-fceratto.json * 14:54 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] (duration: 09m 27s) * 14:44 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (hadoop-test): Regular analytics weekly train test [analytics/refinery@b1aa837f] * 14:44 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] (duration: 01m 06s) * 14:43 joal@deploy1003: Started deploy [analytics/refinery@b1aa837] (thin): Regular analytics weekly train THIN [analytics/refinery@b1aa837f] * 14:42 joal@deploy1003: Finished deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] (duration: 03m 08s) * 14:39 joal@deploy1003: Started deploy [analytics/refinery@b1aa837]: Regular analytics weekly train [analytics/refinery@b1aa837f] * 14:39 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76835 and previous config saved to /var/cache/conftool/dbconfig/20250602-143910-fceratto.json * 14:36 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[1154,1211].eqiad.wmnet with reason: Maintenance * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1067.eqiad.wmnet * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 14:35 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:35 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 14:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P76833 and previous config saved to /var/cache/conftool/dbconfig/20250602-142403-fceratto.json * 14:08 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76832 and previous config saved to /var/cache/conftool/dbconfig/20250602-140854-fceratto.json * 14:04 phuedx: Enabling the SDS 2.4.11 Synthetic A/A Test in xLab * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1197 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76831 and previous config saved to /var/cache/conftool/dbconfig/20250602-135945-fceratto.json * 13:59 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance * 13:59 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76830 and previous config saved to /var/cache/conftool/dbconfig/20250602-135920-fceratto.json * 13:52 jhancock@cumin2002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host prometheus7002.magru.wmnet * 13:49 jmm@cumin1003: END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host prometheus7002.magru.wmnet with OS bookworm * 13:44 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76829 and previous config saved to /var/cache/conftool/dbconfig/20250602-134413-fceratto.json * 13:43 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:41 jclark@cumin1002: END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:38 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts elastic1067.eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:37 bking@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:37 bking@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cirrussearch[1064-1066].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002" * 13:34 jhancock@cumin2002: START - Cookbook sre.hosts.provision for host thanos-be2009.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:29 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P76828 and previous config saved to /var/cache/conftool/dbconfig/20250602-132906-fceratto.json * 13:24 Lucas_WMDE: UTC afternoon backport+config window done * 13:22 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance * 13:22 lucaswerkmeister-wmde@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] (duration: 12m 00s) * 13:21 bking@cumin2002: START - Cookbook sre.dns.netbox * 13:20 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1051.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1048.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1050.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 jclark@cumin1002: START - Cookbook sre.hosts.provision for host cloudcephosd1049.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED * 13:19 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 13:17 jclark@cumin1002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:16 jclark@cumin1002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for cloudcephosd1048-51 - jclark@cumin1002" * 13:15 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Continuing with sync * 13:14 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76826 and previous config saved to /var/cache/conftool/dbconfig/20250602-131359-fceratto.json * 13:13 lucaswerkmeister-wmde@deploy1003: bunnypranav, lucaswerkmeister-wmde: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 13:13 jclark@cumin1002: START - Cookbook sre.dns.netbox * 13:10 lucaswerkmeister-wmde@deploy1003: Started scap sync-world: Backport for [[gerrit:1152191{{!}}core-Namespaces: Add Page, Author to default search ns in ruwikisource (T395632)]] * 13:09 bking@cumin2002: START - Cookbook sre.hosts.decommission for hosts cirrussearch[1064-1066].eqiad.wmnet * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1188 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76825 and previous config saved to /var/cache/conftool/dbconfig/20250602-130548-fceratto.json * 13:05 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance * 13:05 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76824 and previous config saved to /var/cache/conftool/dbconfig/20250602-130523-fceratto.json * 12:50 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76823 and previous config saved to /var/cache/conftool/dbconfig/20250602-125016-fceratto.json * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of es2036.codfw.wmnet onto es2047.codfw.wmnet * 12:37 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 12:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2003.wikimedia.org * 12:35 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P76821 and previous config saved to /var/cache/conftool/dbconfig/20250602-123508-fceratto.json * 12:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host irc2003.wikimedia.org * 12:26 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host prometheus7002.magru.wmnet with OS bookworm * 12:25 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:25 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache prometheus7002.magru.wmnet on all recursors * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 12:24 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:22 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM prometheus7002.magru.wmnet - jmm@cumin1003" * 12:20 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76819 and previous config saved to /var/cache/conftool/dbconfig/20250602-122001-fceratto.json * 12:17 jmm@cumin1003: START - Cookbook sre.dns.netbox * 12:17 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host prometheus7002.magru.wmnet * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh7003.wikimedia.org * 12:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh7003.wikimedia.org with OS bookworm * 12:11 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 12:11 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet * 12:10 mvernon@cumin1002: END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render * 12:10 mvernon@cumin1002: START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1182 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76818 and previous config saved to /var/cache/conftool/dbconfig/20250602-121041-fceratto.json * 12:10 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance * 12:10 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76817 and previous config saved to /var/cache/conftool/dbconfig/20250602-121016-fceratto.json * 12:07 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet * 12:00 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2002.codfw.wmnet * 11:59 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . * 11:57 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . * 11:55 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . * 11:55 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76814 and previous config saved to /var/cache/conftool/dbconfig/20250602-115509-fceratto.json * 11:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host build2002.codfw.wmnet * 11:51 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:49 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on doh7003.wikimedia.org with reason: host reimage * 11:48 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2047 gradually with 4 steps - Pool es2047.codfw.wmnet in after cloning * 11:47 bwojtowicz@deploy1003: helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . * 11:46 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org * 11:42 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org * 11:40 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P76812 and previous config saved to /var/cache/conftool/dbconfig/20250602-114001-fceratto.json * 11:39 claime: cgoubert@mwmaint1002:~$ sudo systemctl restart mediawiki_job_generatecaptcha.service - [[phab:T388531|T388531]] * 11:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org * 11:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org * 11:32 claime: Manual run of cronjobs/generatecaptcha on k8s - [[phab:T388531|T388531]] * 11:31 cgoubert@deploy1003: helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply * 11:30 cgoubert@deploy1003: helmfile [eqiad] START helmfile.d/services/mw-cron: apply * 11:24 fceratto@cumin1002: dbctl commit (dc=all): 'Repooling after maintenance db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76811 and previous config saved to /var/cache/conftool/dbconfig/20250602-112453-fceratto.json * 11:23 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint1001.eqiad.wmnet * 11:19 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint1001.eqiad.wmnet * 11:18 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host doh7003.wikimedia.org with OS bookworm * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM doh7003.wikimedia.org - jmm@cumin1003" * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache doh7003.wikimedia.org on all recursors * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 11:17 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:16 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM doh7003.wikimedia.org - jmm@cumin1003" * 11:15 fceratto@cumin1002: dbctl commit (dc=all): 'Depooling db1156 ([[phab:T395241|T395241]])', diff saved to https://phabricator.wikimedia.org/P76810 and previous config saved to /var/cache/conftool/dbconfig/20250602-111519-fceratto.json * 11:15 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance * 11:14 fceratto@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance * 11:11 jmm@cumin1003: START - Cookbook sre.dns.netbox * 11:11 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host doh7003.wikimedia.org * 11:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76809 and previous config saved to /var/cache/conftool/dbconfig/20250602-111044-root.json * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host durum7003.magru.wmnet * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum7003.magru.wmnet with OS bookworm * 10:58 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-maint2001.codfw.wmnet * 10:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76808 and previous config saved to /var/cache/conftool/dbconfig/20250602-105539-root.json * 10:54 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ldap-maint2001.codfw.wmnet * 10:48 marostegui@cumin1002: END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:41 kamila@deploy1003: helmfile [aux-k8s-eqiad] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-eqiad] START helmfile.d/aux-k8s-services/jaeger: apply * 10:40 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76806 and previous config saved to /var/cache/conftool/dbconfig/20250602-104032-root.json * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] DONE helmfile.d/aux-k8s-services/jaeger: apply * 10:40 kamila@deploy1003: helmfile [aux-k8s-codfw] START helmfile.d/aux-k8s-services/jaeger: apply * 10:37 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:34 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on durum7003.magru.wmnet with reason: host reimage * 10:31 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet * 10:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet * 10:25 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76804 and previous config saved to /var/cache/conftool/dbconfig/20250602-102526-root.json * 10:22 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet * 10:18 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet * 10:12 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet * 10:10 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76802 and previous config saved to /var/cache/conftool/dbconfig/20250602-101020-root.json * 10:08 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet * 10:07 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host durum7003.magru.wmnet with OS bookworm * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache durum7003.magru.wmnet on all recursors * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 10:06 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:06 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM durum7003.magru.wmnet - jmm@cumin1003" * 10:02 marostegui@cumin1002: START - Cookbook sre.mysql.pool es2036 gradually with 4 steps - Pool es2036.codfw.wmnet in after cloning * 10:02 jmm@cumin1003: START - Cookbook sre.dns.netbox * 10:02 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host durum7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ncredir7003.magru.wmnet * 09:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncredir7003.magru.wmnet with OS bookworm * 09:55 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet * 09:55 marostegui@cumin1002: dbctl commit (dc=all): 'es2039 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76800 and previous config saved to /var/cache/conftool/dbconfig/20250602-095514-root.json * 09:45 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2039.codfw.wmnet with reason: Maintenance * 09:44 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2039 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76798 and previous config saved to /var/cache/conftool/dbconfig/20250602-094402-marostegui.json * 09:40 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon1003.wikimedia.org * 09:33 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org * 09:28 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:27 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org * 09:24 jmm@cumin1003: START - Cookbook sre.hosts.downtime for 2:00:00 on ncredir7003.magru.wmnet with reason: host reimage * 09:22 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org * 09:13 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet * 09:10 jelto: update gitlab-settings artifact retention to 6 month - [[phab:T395014|T395014]] * 09:09 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet * 09:02 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:58 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:51 jmm@cumin1003: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 jmm@cumin1003: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" * 08:51 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76796 and previous config saved to /var/cache/conftool/dbconfig/20250602-085105-root.json * 08:47 jmm@cumin1003: START - Cookbook sre.dns.netbox * 08:47 jmm@cumin1003: START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet * 08:45 jmm@cumin1003: END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ncredir7003.magru.wmnet with OS bookworm * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts ncredir7003.magru.wmnet * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) * 08:42 jmm@cumin2002: END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:41 jmm@cumin2002: START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: ncredir7003.magru.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" * 08:37 jmm@cumin2002: START - Cookbook sre.dns.netbox * 08:36 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76795 and previous config saved to /var/cache/conftool/dbconfig/20250602-083559-root.json * 08:33 jmm@cumin2002: START - Cookbook sre.hosts.decommission for hosts ncredir7003.magru.wmnet * 08:20 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76794 and previous config saved to /var/cache/conftool/dbconfig/20250602-082053-root.json * 08:11 jmm@cumin1003: START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm * 08:05 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76793 and previous config saved to /var/cache/conftool/dbconfig/20250602-080547-root.json * 07:58 phuedx@deploy1003: Finished scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] (duration: 35m 59s) * 07:50 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76792 and previous config saved to /var/cache/conftool/dbconfig/20250602-075041-root.json * 07:49 phuedx@deploy1003: phuedx, dr0ptp4kt: Continuing with sync * 07:38 phuedx@deploy1003: phuedx, dr0ptp4kt: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. * 07:36 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet * 07:35 marostegui@cumin1002: dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76791 and previous config saved to /var/cache/conftool/dbconfig/20250602-073535-root.json * 07:27 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet * 07:22 phuedx@deploy1003: Started scap sync-world: Backport for [[gerrit:1152253{{!}}Beta Cluster: Support A/B experiments (T393918)]] * 07:20 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Maintenance * 07:13 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: Maintenance * 07:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76790 and previous config saved to /var/cache/conftool/dbconfig/20250602-070837-root.json * 07:08 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org * 07:06 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es1040 [[phab:T395647|T395647]]', diff saved to https://phabricator.wikimedia.org/P76789 and previous config saved to /var/cache/conftool/dbconfig/20250602-070602-marostegui.json * 07:02 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org * 06:59 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast2003.wikimedia.org * 06:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76787 and previous config saved to /var/cache/conftool/dbconfig/20250602-065331-root.json * 06:53 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org * 06:52 jmm@cumin1003: END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netbox-dev2003.codfw.wmnet * 06:48 jmm@cumin1003: START - Cookbook sre.hosts.reboot-single for host netbox-dev2003.codfw.wmnet * 06:38 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76786 and previous config saved to /var/cache/conftool/dbconfig/20250602-063826-root.json * 06:23 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76785 and previous config saved to /var/cache/conftool/dbconfig/20250602-062320-root.json * 06:08 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76783 and previous config saved to /var/cache/conftool/dbconfig/20250602-060815-root.json * 05:53 marostegui@cumin1002: dbctl commit (dc=all): 'db1169 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76782 and previous config saved to /var/cache/conftool/dbconfig/20250602-055309-root.json * 05:39 marostegui@cumin1002: dbctl commit (dc=all): 'Depool db1169 [[phab:T395663|T395663]]', diff saved to https://phabricator.wikimedia.org/P76781 and previous config saved to /var/cache/conftool/dbconfig/20250602-053905-marostegui.json * 05:38 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1169.eqiad.wmnet with reason: Maintenance * 05:20 marostegui@cumin1002: dbctl commit (dc=all): 'Add es2047 to dbctl depooled [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76780 and previous config saved to /var/cache/conftool/dbconfig/20250602-051957-marostegui.json * 05:15 marostegui@cumin1002: START - Cookbook sre.mysql.clone of es2036.codfw.wmnet onto es2047.codfw.wmnet * 05:02 marostegui@cumin1002: DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2036.codfw.wmnet with reason: Maintenance * 05:01 marostegui@cumin1002: dbctl commit (dc=all): 'Depool es2036 [[phab:T395771|T395771]]', diff saved to https://phabricator.wikimedia.org/P76779 and previous config saved to /var/cache/conftool/dbconfig/20250602-050150-marostegui.json ==Archives == See [[Server Admin Log/Archives]]. <noinclude> [[Category:SAL]] [[Category:Operations]] </noinclude> i93zu1nsxw2sxajbe9rixz05wl478gf Nova Resource:Tools.heritage/SAL 498 15050 2309624 2245756 2025-06-08T12:00:27Z Stashbot 7414 wmbot~multichill@tools-bastion-12: Switched ErfgoedBot to BotPasswords for T395205, unable to test 2309624 wikitext text/x-wiki === 2025-06-08 === * 12:00 wmbot~multichill@tools-bastion-12: Switched ErfgoedBot to BotPasswords for [[phab:T395205|T395205]], unable to test === 2024-11-19 === * 17:54 wmbot~multichill@tools-bastion-12: loaded the jobs.yml, job list was empty, and started update-monuments === 2024-09-01 === * 18:56 wmbot~jeanfred@tools-sgebastion-10: Deploy {{Gerrit|b11a390}}, {{Gerrit|d178412}}, {{Gerrit|669b549}}, {{Gerrit|e1e2330}}, {{Gerrit|d042582}} ([[phab:T174633|T174633]]), {{Gerrit|e079bf5}} ([[phab:T319787|T319787]]), {{Gerrit|856f770}}, {{Gerrit|86c9e0e}}, {{Gerrit|4faee34}} ([[phab:T319787|T319787]]) === 2024-08-22 === * 05:37 wmbot~jeanfred@tools-sgebastion-10: Started harvesting again with 4GB memory === 2024-08-19 === * 13:11 wmbot~jeanfred@tools-sgebastion-10: Recreate Python virtual environment with toolforge jobs run bootstrap-venv --command "cd $PWD && ./bin/build-python.sh" --image python3.7 --wait * 13:10 wmbot~jeanfred@tools-sgebastion-10: Delete all venvs (including backups) * 11:46 wmbot~jeanfred@tools-sgebastion-10: Deploy {{Gerrit|4ed5e87}}, {{Gerrit|59fbee8}}, {{Gerrit|2a905c0}}, {{Gerrit|6e0373f}}, {{Gerrit|73f7fa6}}, {{Gerrit|172df40}}, {{Gerrit|4fd2acc}}, {{Gerrit|de95fcc}}, {{Gerrit|6559353}}, {{Gerrit|b7c6cff}}, {{Gerrit|9f15d20}}, {{Gerrit|72f96a8}}, {{Gerrit|b126e2b}}, {{Gerrit|7171470}}, {{Gerrit|a944bf6}}, {{Gerrit|70bce45}}, {{Gerrit|41c79d7}} === 2023-10-15 === * 20:27 wm-bot: <jeanfred> Deploy {{Gerrit|70616a4}} ([[phab:T345695|T345695]]), {{Gerrit|37fe60b}}, {{Gerrit|764e236}} ([[phab:T345695|T345695]]), {{Gerrit|b46cdbe}}, {{Gerrit|6caa9be}}, {{Gerrit|c81c7f6}}, {{Gerrit|60cd8c3}}, {{Gerrit|7925f7a}}, {{Gerrit|9a70390}}, {{Gerrit|0a8c490}} ([[phab:T346681|T346681]]) === 2023-08-31 === * 22:01 wm-bot: <jeanfred> Enable check_emailable_users crontab for WLM 2023 === 2023-08-26 === * 06:11 wm-bot: <jeanfred> Deploy {{Gerrit|b91e3b0}}, {{Gerrit|cfc94e1}} ([[phab:T338987|T338987]]), {{Gerrit|ad1597a}} ([[phab:T338987|T338987]]), {{Gerrit|e68e0e7}} ([[phab:T338987|T338987]]) === 2023-08-23 === * 07:13 wm-bot: <jeanfred> Removed STRICT_TRANS_TABLES from SQL_MODE (again) for [[phab:T338987|T338987]] === 2023-08-22 === * 19:25 wm-bot: <jeanfred> Deploy {{Gerrit|bc9526e}} ([[phab:T338987|T338987]]), {{Gerrit|4283cc2}} ([[phab:T338987|T338987]]), {{Gerrit|1da16b6}} ([[phab:T338987|T338987]]), {{Gerrit|8e60754}} ([[phab:T338987|T338987]]) * 19:14 wm-bot: <jeanfred> Deploy {{Gerrit|8a14669}}, {{Gerrit|9b3686d}} === 2023-08-21 === * 16:00 wm-bot: <jeanfred> Set EMPTY_STRING_IS_NULL SQL mode (using SET sql_mode = 'STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION,EMPTY_STRING_IS_NULL';) for [[phab:T338987|T338987]] * 16:00 wm-bot: <jeanfred> Set EMPTY_STRING_IS_NULL SQL mode (using ) for [[phab:T338987|T338987]] * 15:40 wm-bot: <jeanfred> Deploy {{Gerrit|ac59511}}, {{Gerrit|cb0e175}}, {{Gerrit|a022272}}, {{Gerrit|ee7a125}}, {{Gerrit|12bc606}}, {{Gerrit|fb92a1b}}, {{Gerrit|f15a563}} ([[phab:T341773|T341773]]), {{Gerrit|1a7a6d2}}, {{Gerrit|5745733}}, {{Gerrit|75d2ae8}}, {{Gerrit|51c073f}}, {{Gerrit|79da223}} === 2023-05-16 === * 12:22 wm-bot: <jeanfred> Run check_emailable_users.py -category:"Images_from_Wiki_Loves_Earth_2023" -delta:23040 -notify to catch up with the first 16 days of WLE 2023 === 2023-03-08 === * 12:47 wm-bot: <lokal-profil> Deploy {{Gerrit|a4d8f5e}}, {{Gerrit|ed09031}}, {{Gerrit|aef11be}}, {{Gerrit|a9a3815}}, {{Gerrit|1ac1b7d}}, {{Gerrit|2f02594}}, {{Gerrit|3a34d3a}}, {{Gerrit|88ebc6e}} ([[phab:T317278|T317278]]), {{Gerrit|c65ef5c}}, {{Gerrit|ed58572}}, {{Gerrit|dbb296e}}, {{Gerrit|c9878d0}}, {{Gerrit|af60a34}}, {{Gerrit|71378e1}} ([[phab:T327956|T327956]]), {{Gerrit|133bbf2}} ([[phab:T317279|T317279]]), {{Gerrit|91eab60}} * 12:44 wm-bot: <lokal-profil> Deploy {{Gerrit|8b52299}} * 12:35 wm-bot: <lokal-profil> Deploy {{Gerrit|8b52299}} === 2022-10-03 === * 09:55 wm-bot: <lokal-profil> Deploy {{Gerrit|8b52299}} === 2022-09-08 === * 19:20 wm-bot: <jeanfred> Run check_emailable_users.py -category:"Images_from_Wiki_Loves_Monuments_2022" -delta:13000 -notify to catch up with the first 8 days of WLM 2022 * 18:53 wm-bot: <jeanfred> Deploy {{Gerrit|ae766f9}} === 2022-09-07 === * 20:18 wm-bot: <jeanfred> Run check_emailable_users.py -category:"Images_from_Wiki_Loves_Monuments_2022" -delta:10500 -notify to catch up with the first 7 days of WLM 2022 === 2022-08-31 === * 20:11 wm-bot: <jeanfred> Deploy {{Gerrit|b1eb124}} ([[phab:T316626|T316626]]) * 20:07 wm-bot: <jeanfred> Deploy {{Gerrit|62d7547}} * 08:16 wm-bot: <lokal-profil> Deploy {{Gerrit|3bb8b8c}}, {{Gerrit|2e51c66}} ([[phab:T316645|T316645]]) === 2022-08-29 === * 20:33 wm-bot: <jeanfred> Deploy {{Gerrit|da97380}}, {{Gerrit|d5780dc}}, {{Gerrit|49ddbcb}}, {{Gerrit|327c03b}}, {{Gerrit|14c4ca9}}, {{Gerrit|74bff71}}, {{Gerrit|7d866b7}}, {{Gerrit|11fb1b0}} ([[phab:T307269|T307269]]), {{Gerrit|406909f}}, {{Gerrit|857152e}} ([[phab:T225409|T225409]]) === 2022-05-03 === * 08:18 wm-bot: <jeanfred> Trigger a full update_monuments job post-[[phab:T307269|T307269]] * 08:17 wm-bot: <jeanfred> Run ./bin/build.sh to rebuild the virtualenv with Python3.7 on a Buster node (for [[phab:T307269|T307269]]) * 08:16 wm-bot: <jeanfred> Run ./bin/build.sh to rebuild the virtualenv with Python3.7 on a Buster node * 08:07 wm-bot: <jeanfred> Deploy latest from Git: {{Gerrit|ef4111a}}, {{Gerrit|f70063a}}, {{Gerrit|db03f84}}, {{Gerrit|27fa2f6}} === 2022-04-27 === * 21:13 wm-bot: <jeanfred> Manual trigger of categorize_images.sh === 2022-04-18 === * 19:56 wm-bot: <lokal-profil> Adding the -release buster flag to cronjobs as part of Stretch deprecation === 2022-03-03 === * 20:32 wm-bot: <jeanfred> Deploy {{Gerrit|fcedb0f}}, {{Gerrit|10b63ae}}, {{Gerrit|a00f000}}, {{Gerrit|6d2fc8a}} === 2022-02-02 === * 12:49 wm-bot: <jeanfred> Trigger full update_monuments.sh run now that [[phab:T300252|T300252]] is fixed * 12:47 wm-bot: <jeanfred> Deploy {{Gerrit|2715e85}} ([[phab:T300252|T300252]]) * 12:30 urbanecm: Run P19935 in `s51138__heritage_p` to workaround [[phab:T300252|T300252]] for Czech * 11:42 urbanecm: Run `python erfgoedbot/update_database.py -countrycode:cz -langcode:cs` with 759229 applied to ensure fix for [[phab:T300252|T300252]] works === 2022-01-27 === * 05:45 wm-bot: <jeanfred> Deploy latest from Git master: {{Gerrit|7ed2ef3}}, {{Gerrit|7ed27df}}, {{Gerrit|d1fe905}}, {{Gerrit|da4b9a4}}, {{Gerrit|681c3c2}}, {{Gerrit|81f91ab}}, {{Gerrit|ede4d73}}, {{Gerrit|1e5e614}}, {{Gerrit|c43703e}} ([[phab:T295238|T295238]]) === 2021-09-06 === * 11:21 wm-bot: <jeanfred> Run check_emailable_users.py -category:"Images_from_Wiki_Loves_Monuments_2021" -delta:10500 -notify to catch up with the first 7 days of WLM 2021 * 11:19 wm-bot: <jeanfred> Bumped check_emailable_users to 2021 and re-enabled cron job === 2021-09-04 === * 19:50 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|b4d3e0e}}, {{Gerrit|339838b}} ([[phab:T289929|T289929]]), {{Gerrit|7816a36}} ([[phab:T289930|T289930]]) === 2021-08-26 === * 21:27 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|f8959e7}}, {{Gerrit|96c6f56}}, {{Gerrit|8f46ca8}}, {{Gerrit|841e8e8}}, {{Gerrit|7d3cfbc}}, {{Gerrit|685e4f6}}, {{Gerrit|b1a510f}}, {{Gerrit|78b1ce7}}, {{Gerrit|7172b59}}, {{Gerrit|a6d1889}}, {{Gerrit|2bfb4a0}}, {{Gerrit|a200b30}}, {{Gerrit|62ea8c3}}, {{Gerrit|2522867}}, {{Gerrit|2790623}}, {{Gerrit|b4eed97}}, {{Gerrit|3b29047}}, {{Gerrit|1f1d2da}}, {{Gerrit|27c3613}}, {{Gerrit|6003fc6}}, {{Gerrit|6bd1a0d}}, {{Gerrit|d577a84}} ([[phab:T278918|T278918]]), {{Gerrit|402b161}}, {{Gerrit|2bdd929}}, {{Gerrit|7f50cb3}} ([[phab:T286354|T286354]]), {{Gerrit|2be29f1}} === 2020-09-04 === * 21:22 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|0f24d39}} === 2020-09-03 === * 15:14 wm-bot: <jeanfred> Deploy latest from Git master: {{Gerrit|f43b6aa}}, {{Gerrit|a379e25}}, {{Gerrit|b6b5e92}}, {{Gerrit|5108541}} === 2020-09-02 === * 08:00 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|037df2f}}, {{Gerrit|9c73470}} === 2020-08-31 === * 22:02 wm-bot: <lokal-profil> Manually triggered update_monuments (after temporarily setting setuptools==49.3.0) * 13:55 wm-bot: <lokal-profil> Manually triggered update_monuments (again, again) * 13:53 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|069df77}}, {{Gerrit|907c6fe}} ([[phab:T224405|T224405]]), {{Gerrit|21d30e0}}, {{Gerrit|25a1951}} === 2020-08-30 === * 21:10 wm-bot: <lokal-profil> Manually triggered update_monuments (again) * 21:06 wm-bot: <lokal-profil> Forced re-install of pywikibot over pip * 20:40 wm-bot: <lokal-profil> Bumped check_emailable_users to 2020 and re-enabled cron job * 20:36 wm-bot: <lokal-profil> Manually triggered update_monuments * 20:19 wm-bot: <lokal-profil> rename existing .venv and re-run build.sh * 11:32 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|0955c33}} ([[phab:T224405|T224405]]) === 2020-08-28 === * 08:27 wm-bot: <lokal-profil> Deploy latest from Git master: {{Gerrit|415c004}}, {{Gerrit|f3ac064}} === 2020-08-26 === * 21:29 wm-bot: <jeanfred> jsub new categorize_images job to test {{Gerrit|7b0a9d7}} * 21:29 wm-bot: <jeanfred> Deploy latest from Git master: {{Gerrit|7b0a9d7}} === 2020-08-25 === * 16:34 wm-bot: <jeanfred> jstop categorize_images, jsub new one * 15:42 wm-bot: <jeanfred> Deploy latest from Git master: {{Gerrit|f0d25be}}, {{Gerrit|b7d8ad4}}, {{Gerrit|021ba20}}, {{Gerrit|774663a}}, {{Gerrit|a0fa5f5}}, {{Gerrit|7b66f3a}}, {{Gerrit|072dfd7}}, {{Gerrit|eb61a9a}}, {{Gerrit|a2f600e}}, {{Gerrit|7da8c34}}, {{Gerrit|8ddfa1f}}, {{Gerrit|8072796}}, {{Gerrit|7b46c91}}, {{Gerrit|b1ec885}}, {{Gerrit|79addba}}, {{Gerrit|997c7a7}}, {{Gerrit|12a3ec6}} === 2020-07-06 === * 22:55 wm-bot: <root> Migrated .webservicerc to service.template ([[phab:T257229|T257229]]) === 2020-02-25 === * 23:17 wm-bot: <root> Migrated to 2020 Kubernetes cluster === 2020-02-24 === * 21:04 JeanFred: Deploy latest from Git master: {{Gerrit|d2f2eab}} ([[phab:T176560|T176560]]), {{Gerrit|36b4940}}, {{Gerrit|a493161}}, {{Gerrit|2cea31c}}, {{Gerrit|f63aa99}}, {{Gerrit|09efc1c}}, {{Gerrit|7e78abb}}, {{Gerrit|2feab97}}, {{Gerrit|dd5062c}}, {{Gerrit|a3616d8}}, {{Gerrit|465e82e}}, {{Gerrit|010917d}} ([[phab:T244445|T244445]]), {{Gerrit|2bcef18}} === 2020-02-11 === * 20:37 wm-bot157: <lokal-profil> Triggered a manual categorisation job for ua_uk ([[phab:T244333|T244333]]) === 2020-02-05 === * 08:50 wm-bot: <jeanfred> Triggered a manual jsub update_monuments.sh ([[phab:T244213|T244213]]) * 08:45 wm-bot: <lokal-profil> Killed update_monuments job ([[phab:T244213|T244213]]) === 2019-11-02 === * 14:29 wm-bot: <jeanfred> Triggering a new full database update as the last one crashed and the DB is empty === 2019-10-12 === * 22:17 wm-bot: <jeanfred> Deleted old GridEngine jobs from Oct 7 (jstop update_monuments jstop categorize_images) === 2019-10-10 === * 19:02 wm-bot: <lokal-profil> restarted webservice === 2019-09-20 === * 19:49 wm-bot: <jeanfred> Added crontab (*/3 - every three minutes) for [[phab:T195341|T195341]] === 2019-09-18 === * 20:52 wm-bot: <jeanfred> check_emailable_users.py -category:"Images_from_Wiki_Loves_Monuments_2019" -delta:540 -notify ([[phab:T195341|T195341]]) === 2019-09-17 === * 21:04 JeanFred: Deploy latest from Git master: {{Gerrit|b068bd4}}, {{Gerrit|9f625b6}} ([[phab:T195341|T195341]]), {{Gerrit|799525d}} ([[phab:T195341|T195341]]) === 2019-09-09 === * 22:00 JeanFred: Deploy latest from Git master: {{Gerrit|109be22}} ([[phab:T231727|T231727]]) * 21:42 JeanFred: Deploy latest from Git master: {{Gerrit|0a4fbae}}, {{Gerrit|dff30f0}} ([[phab:T232394|T232394]], [[phab:T138668|T138668]]) === 2019-09-08 === * 22:20 JeanFred: Deploy latest from Git master: {{Gerrit|d36f393}} ([[phab:T138668|T138668]]) * 21:46 JeanFred: Deploy latest from Git master: {{Gerrit|5455f76}}, {{Gerrit|07eebf8}} ([[phab:T195341|T195341]]), {{Gerrit|3cb734b}} === 2019-09-03 === * 08:24 JeanFred: Deploy latest from Git master: {{Gerrit|39aec68}}, {{Gerrit|59e2ffb}} === 2019-09-02 === * 17:54 JeanFred: Deploy latest from Git master: {{Gerrit|33f4635}}, {{Gerrit|fd60416}} ([[phab:T195341|T195341]]) === 2019-09-01 === * 12:59 JeanFred: Deploy latest from Git master: {{Gerrit|52b2f4d}} === 2019-08-31 === * 21:25 JeanFred: Deploy latest from Git master: {{Gerrit|7bf1d3c}} ([[phab:T173783|T173783]]) === 2019-08-30 === * 16:46 JeanFred: Deploy latest from Git master: {{Gerrit|809e004}} === 2019-08-29 === * 20:49 JeanFred: Deploy latest from Git master: {{Gerrit|c518c02}} ([[phab:T172690|T172690]]) * 16:10 JeanFred: Deploy latest from Git master: {{Gerrit|d6381ec}}, {{Gerrit|5f8496e}}, {{Gerrit|0c6b6ad}}, {{Gerrit|c98a94d}} === 2019-08-26 === * 21:16 JeanFred: [[phab:T231223|T231223]]: webservice php5.6 stop ; webservice --backend=kubernetes php7.2 start * 21:09 JeanFred: Deploy latest from Git master: {{Gerrit|0aac903}}, {{Gerrit|1b3db24}}, {{Gerrit|cca0978}}, {{Gerrit|88a754a}}, {{Gerrit|d93b940}}, {{Gerrit|a571c03}}, {{Gerrit|1839191}}, {{Gerrit|0591ddc}}, {{Gerrit|6728961}}, {{Gerrit|f1e82c5}}, {{Gerrit|222780b}}, {{Gerrit|30813e8}}, {{Gerrit|e1a2422}}, {{Gerrit|94ddd2d}}, {{Gerrit|c6f7694}}, {{Gerrit|e019c8a}}, {{Gerrit|089ff87}}, {{Gerrit|3efdc7e}}, {{Gerrit|033ff27}}, {{Gerrit|f0fdecb}}, {{Gerrit|2e35e27}}, {{Gerrit|ba68f7b}}, {{Gerrit|975f530}}, {{Gerrit|58c8f34}}, {{Gerrit|8f30ee7}} ([[phab:T231152|T231152]]), {{Gerrit|a39179f}}, {{Gerrit|00d50e0}} ([[phab:T216364|T216364]]), {{Gerrit|0c0108e}}, {{Gerrit|afb11ea}}, {{Gerrit|ec66b4f}}, {{Gerrit|8d23f1e}}, === 2019-04-12 === * 10:04 JeanFred: Started a new harvest to better investigate [[phab:T180833|T180833]] === 2019-03-22 === * 21:40 Lokal_Profil: Manually starting harvesting job to ensure migration ([[phab:T216364|T216364]]) worked * 20:52 Lokal_Profil: Starting Stretch migration === 2018-12-01 === * 14:23 JeanFred: Service was returning 502 (even though `webservice status` was indicating "Your webservice of type php5.6 is running"), did a `webservice restart`. === 2018-10-04 === * 08:29 Lokal_Profil: Deploy latest from Git master: {{Gerrit|9fcc92a}}, {{Gerrit|0bd08cd}} === 2018-10-02 === * 21:37 Lokal_Profil: restart webservice * 21:02 Lokal_Profil: Deploy latest from Git master: {{Gerrit|33ab2cc}}, {{Gerrit|7e7533f}}, {{Gerrit|07bf55b}}, {{Gerrit|63530f4}}, {{Gerrit|ae38b67}}, {{Gerrit|5747927}} ([[phab:T203349|T203349]]), {{Gerrit|72822ee}} ([[phab:T204581|T204581]]), {{Gerrit|35e9b82}} ([[phab:T204580|T204580]]), {{Gerrit|0b2466a}} ([[phab:T176724|T176724]]) === 2018-09-20 === * 19:22 JeanFred: Deploy latest from Git master: {{Gerrit|a14a46d}} ([[phab:T203349|T203349]]) * 15:25 JeanFred: Deploy latest from Git master: {{Gerrit|0252aa2}}, {{Gerrit|c14485e}}, {{Gerrit|211a1d9}}, {{Gerrit|3fd9907}}, {{Gerrit|af28b24}}, {{Gerrit|eea49d4}}, {{Gerrit|2c7c42f}} ([[phab:T204351|T204351]]), {{Gerrit|e295352}}, {{Gerrit|093582a}}, {{Gerrit|0be7477}} === 2018-09-13 === * 22:28 JeanFred: Deploy latest from Git master: {{Gerrit|d850876}}, {{Gerrit|e6ea225}} ([[phab:T144724|T144724]]) * 15:34 JeanFred: Deploy latest from Git master: {{Gerrit|ea05949}}, {{Gerrit|4946df8}} ([[phab:T203251|T203251]]), {{Gerrit|f2e1368}}, {{Gerrit|3664168}} === 2018-09-10 === * 16:47 JeanFred: Deploy latest from Git master: {{Gerrit|6efe97f}}, {{Gerrit|f0c4ddd}} * 13:06 JeanFred: Deploy latest from Git master: {{Gerrit|08542e8}}, {{Gerrit|b9551e6}} ([[phab:T203250|T203250]]) === 2018-09-09 === * 10:01 JeanFred: Deploy latest from Git master: {{Gerrit|15d2b04}}, {{Gerrit|b8e7751}}, {{Gerrit|62d36e9}}, {{Gerrit|27e4a91}}, {{Gerrit|613a23d}}, {{Gerrit|4f8faef}}, {{Gerrit|edcb7ef}} ([[phab:T203428|T203428]]), {{Gerrit|0e3fa2d}} ([[phab:T203903|T203903]]) === 2018-09-05 === * 18:05 JeanFred: Deploy latest from Git master: {{Gerrit|8466ed3}} ([[phab:T202293|T202293]]), {{Gerrit|06586e9}} ([[phab:T203432|T203432]]), {{Gerrit|0631774}}, {{Gerrit|d2ff281}} ([[phab:T200159|T200159]]), {{Gerrit|d55090e}} === 2018-09-04 === * 06:36 JeanFrd: Edit crontab to run `update_monuments` with ` -mem 2000m` (see [[phab:T203417|T203417]]) * 06:35 JeanFrd: Submit `update_monuments` manually with ` -mem 2000m` (see [[phab:T203417|T203417]]) * 06:26 JeanFred: [yesterday] Deploy latest from Git master: {{Gerrit|68de059}}, {{Gerrit|bffe358}}, {{Gerrit|d8c9e7e}} * 06:26 JeanFred: [two days ago] Bump the pywikibot checked out on the server in ~/pywikibot (git fetch && git checkout 3.0.{{Gerrit|20180823}}) === 2018-09-02 === * 21:21 Lokal_Profil: Manual restart of update_monuments.sh === 2018-09-01 === * 21:48 Lokal-Profil: Deploy latest from Git master: {{Gerrit|b7a9c43}} ([[phab:T202378|T202378]]) * 20:23 Lokal-Profil: Deploy latest from Git master: {{Gerrit|17b27b3}}, {{Gerrit|c5122c7}} ([[phab:T200337|T200337]]), {{Gerrit|6e85721}}, {{Gerrit|b48135e}} ([[phab:T174871|T174871]]) === 2018-08-24 === * 07:36 Lokal_Profil: Deploy latest from Git master: {{Gerrit|69d0bac}}, {{Gerrit|0acc6a1}}, {{Gerrit|84108e1}} === 2018-08-20 === * 17:34 JeanFred: Deploy latest from Git master: {{Gerrit|3bfc7a0}}, {{Gerrit|6f78388}} ([[phab:T202253|T202253]]), {{Gerrit|239ac46}} ([[phab:T202280|T202280]]), {{Gerrit|44efaa3}}, {{Gerrit|d40450d}} === 2018-08-13 === * 09:24 Lokal_Profil: Deploy latest from Git master: {{Gerrit|5ea3c21}}, {{Gerrit|0d6158d}} ([[phab:T200325|T200325]]) === 2018-07-31 === * 09:12 Lokal_Profil: Deploy latest from Git master: {{Gerrit|a230a6e}} ([[phab:T200337|T200337]]), {{Gerrit|12f5559}}, {{Gerrit|bad6d8d}} ([[phab:T200410|T200410]]), {{Gerrit|3c7e8d9}}, {{Gerrit|e00f1b3}}, {{Gerrit|a021cb9}} ([[phab:T200428|T200428]]), {{Gerrit|0631408}}, {{Gerrit|d2e8246}}, {{Gerrit|b9ff533}}, {{Gerrit|a471c38}}, {{Gerrit|1c5b9f6}} [correction to previous] * 09:08 Lokal_Profil: Manually resolving conflicts and creating a local commit due to [[phab:T200101|T200101]] * 09:07 Lokal_Profil: Deploy latest from Git master: {{Gerrit|6bffdbb}}, {{Gerrit|65f04d2}}, {{Gerrit|df4e313}} ([[phab:T200326|T200326]]), {{Gerrit|2b5e868}} ([[phab:T176845|T176845]]) === 2018-07-26 === * 13:18 JeanFred: Deploy latest from Git master: {{Gerrit|6bffdbb}}, {{Gerrit|65f04d2}}, {{Gerrit|df4e313}} ([[phab:T200326|T200326]]), {{Gerrit|2b5e868}} ([[phab:T176845|T176845]]) * 12:55 Lokal_Profil: Deploy latest from Git master: {{Gerrit|6bffdbb}}, {{Gerrit|65f04d2}}, {{Gerrit|df4e313}} ([[phab:T200326|T200326]]), {{Gerrit|2b5e868}} ([[phab:T176845|T176845]]) === 2018-07-22 === * 13:12 Lokal_Profil: Deploy latest from Git master: {{Gerrit|6dba253}}, {{Gerrit|f3552bc}} ([[phab:T176528|T176528]]), {{Gerrit|0eb8a6f}} ([[phab:T176528|T176528]]), {{Gerrit|9cad025}}, {{Gerrit|ac7ab36}} ([[phab:T176528|T176528]], [[phab:T176722|T176722]]) === 2018-07-20 === * 22:26 Lokal_Profil: rebuilt statistics using maintenance/_buildStats.php * 21:45 Lokal_Profil: dropped and recreated the `statistics` and the `monuments_nl-wd_(nl)` tables * 17:16 Lokal_Profil: Manually resolving conflicts and creating a local commit. Resolving the difference to master is tracked at https://phabricator.wikimedia.org/T200101 * 16:52 Lokal_Profil: Deploy latest from Git master: {{Gerrit|4031f51}} ([[phab:T176733|T176733]]), {{Gerrit|3b0661e}} ([[phab:T180692|T180692]]), {{Gerrit|8a60fd9}}, {{Gerrit|1b69a75}}, {{Gerrit|ccb8fb6}}, {{Gerrit|c54c56b}}, {{Gerrit|823ce1b}}, {{Gerrit|c3b7734}} ([[phab:T175907|T175907]]), {{Gerrit|9a14fef}}, {{Gerrit|962fe45}}, {{Gerrit|b5f3bc5}}, {{Gerrit|6662976}}, {{Gerrit|1b5bda8}}, {{Gerrit|121cbbd}}, {{Gerrit|cf30736}}, {{Gerrit|059ff06}}, {{Gerrit|3718a5e}}, {{Gerrit|e61bb05}}, {{Gerrit|ce2b43a}}, {{Gerrit|b73d519}}, {{Gerrit|a0e9888}}, {{Gerrit|6567c2a}}, {{Gerrit|929a49f}}, {{Gerrit|12680dd}}, {{Gerrit|dec0d98}} === 2018-02-25 === * 20:42 Lokal_Profil: Deploy latest from Git master: {{Gerrit|4031f51}} ([[phab:T176733|T176733]]), {{Gerrit|3b0661e}} ([[phab:T180692|T180692]]), {{Gerrit|8a60fd9}}, {{Gerrit|1b69a75}}, {{Gerrit|ccb8fb6}} ([[phab:T180850|T180850]]), {{Gerrit|c54c56b}} ([[phab:T179216|T179216]]), {{Gerrit|823ce1b}}, {{Gerrit|c3b7734}} ([[phab:T175907|T175907]]), {{Gerrit|9a14fef}}, {{Gerrit|962fe45}}, {{Gerrit|b5f3bc5}}, {{Gerrit|6662976}}, {{Gerrit|1b5bda8}}, {{Gerrit|121cbbd}}, {{Gerrit|cf30736}}, {{Gerrit|059ff06}}, {{Gerrit|3718a5e}}, {{Gerrit|e61bb05}}, {{Gerrit|ce2b43a}}, {{Gerrit|b73d519}}, {{Gerrit|a0e9888}}, {{Gerrit|6567c2a}}, {{Gerrit|929a49f}}, {{Gerrit|12680dd}} ([[phab:T2|T2]]), {{Gerrit|dec0d98}} === 2017-11-17 === * 21:31 JeanFred: Reverted to old database replicas (via `git reset HEAD~1 && git stash`) as part of [[phab:T180833|T180833]] investigation * 19:30 JeanFred: Started a new harvest to better investigate [[phab:T180833|T180833]] === 2017-11-16 === * 14:27 JeanFred: Deploy latest from Git master: {{Gerrit|b47ecb5}}, {{Gerrit|151e7dd}}, {{Gerrit|5c5014f}}, {{Gerrit|a80e017}}, {{Gerrit|f6313f5}} ([[phab:T180068|T180068]]), {{Gerrit|9742028}}, {{Gerrit|4031f51}} ([[phab:T176733|T176733]]) === 2017-10-26 === * 10:22 Lokal_Profil: Deploy latest from Git master: {{Gerrit|07ea3ce}} ([[phab:T176712|T176712]]), {{Gerrit|a4d73d8}}, {{Gerrit|d0ddf7a}}, {{Gerrit|4b9991d}} ([[phab:T176465|T176465]]), {{Gerrit|83259c9}}, {{Gerrit|14d6a1d}} === 2017-09-28 === * 23:28 JeanFred: Deploy latest from Git master: {{Gerrit|a95cf25}}, {{Gerrit|65e95f8}}, {{Gerrit|199f01b}} ([[phab:T176118|T176118]]), {{Gerrit|54a71d9}} ([[phab:T176991|T176991]]) === 2017-09-26 === * 23:21 JeanFred: Deploy latest from Git master: {{Gerrit|0dddaf5}} * 23:09 JeanFred: Deploy latest from Git master: {{Gerrit|8248ff4}} ([[phab:T176200|T176200]]) * 11:13 JeanFred: Deploy latest from Git master: {{Gerrit|2828a0f}} ([[phab:T117327|T117327]]) * 09:52 JeanFred: Deploy latest from Git master: {{Gerrit|263ccee}} ([[phab:T174556|T174556]]) === 2017-09-25 === * 09:49 JeanFred: Deploy latest from Git master: {{Gerrit|1ab75f9}} ([[phab:T174333|T174333]]) === 2017-09-24 === * 20:46 JeanFred: Deploy latest from Git master: {{Gerrit|96feb02}} ([[phab:T174426|T174426]]), {{Gerrit|dafe240}} * 20:45 JeanFred: Deploy latest from Git master: {{Gerrit|2f25778}} ([[phab:T174505|T174505]]) * 20:06 JeanFred: Deploy latest from Git master: {{Gerrit|2f25778}} ([[phab:T174505|T174505]]) * 19:10 JeanFred: Deploy latest from Git master: {{Gerrit|60a6097}} ([[phab:T174261|T174261]], [[phab:T174340|T174340]]), {{Gerrit|32c1d0d}} === 2017-09-23 === * 18:31 JeanFred: Deploy latest from Git master: {{Gerrit|f8a1b8a}} ([[phab:T153744|T153744]]) * 18:25 JeanFred: Deploy latest from Git master: {{Gerrit|1a7b88c}} ([[phab:T176528|T176528]]), {{Gerrit|150f545}} ([[phab:T176530|T176530]]) * 13:44 JeanFred: Deploy latest from Git master: {{Gerrit|30af42c}} ([[phab:T117330|T117330]]) === 2017-09-22 === * 17:10 JeanFred: Deploy latest from Git master: {{Gerrit|5510585}}, {{Gerrit|508a947}} === 2017-09-21 === * 22:57 JeanFred: Deploy latest from Git master: {{Gerrit|783aca6}}, {{Gerrit|6932790}} * 19:35 JeanFred: Deploy latest from Git master: {{Gerrit|730d577}} ([[phab:T174871|T174871]]) * 19:23 JeanFred: Deploy latest from Git master: {{Gerrit|bf7488d}} ([[phab:T174614|T174614]]) === 2017-09-19 === * 17:11 JeanFred: Deploy latest from Git master: {{Gerrit|5e5c828}}, {{Gerrit|2486630}} ([[phab:T175906|T175906]]), {{Gerrit|3ac13b6}} ([[phab:T175899|T175899]]) === 2017-09-18 === * 13:58 JeanFred: Killed `update_monuments` job as it had been stuck for 3 days on the `Update monuments_all table` step. === 2017-09-17 === * 22:36 Lokal_Profil: Service down again ("backend is overloaded" in error.log). Tried another webservice restart * 18:28 JeanFred: Deploy latest from Git master: {{Gerrit|b3c3ab0}} ([[phab:T174871|T174871]]) * 18:28 JeanFred: Restarted the webservice − Yarl notified us around 10 hours ago that API was inaccessible. `webservice restart` fixed the issue. === 2017-09-14 === * 09:36 JeanFred: Deploy latest from Git master: {{Gerrit|d2aa019}}, {{Gerrit|0766491}}, {{Gerrit|d73eb9e}} ([[phab:T174871|T174871]]), {{Gerrit|5b00f0b}}, {{Gerrit|f8ff2a6}} ([[phab:T175839|T175839]]) * 08:37 JeanFred: Deploy latest from Git master: {{Gerrit|c5b8ffb}}, {{Gerrit|837707f}}, {{Gerrit|5799d26}} ([[phab:T174261|T174261]]), {{Gerrit|d330733}} ([[phab:T174340|T174340]]) === 2017-09-13 === * 16:53 JeanFred: Deploy latest from Git master: {{Gerrit|c5b8ffb}}, {{Gerrit|837707f}}, {{Gerrit|5799d26}} ([[phab:T174261|T174261]]), {{Gerrit|d330733}} ([[phab:T174340|T174340]]) === 2017-09-06 === * 20:13 Lokal_Profil: Deploy latest from Git master: {{Gerrit|7874ca3}} ([[phab:T173929|T173929]]), {{Gerrit|c53b5dc}}, {{Gerrit|2b37db6}} ([[phab:T174901|T174901]]), {{Gerrit|7206e0d}} === 2017-09-04 === * 14:05 Lokal_Profil: Deploy latest from Git master: {{Gerrit|9f59c59}} ([[phab:T174245|T174245]]), {{Gerrit|bc94498}}, {{Gerrit|d5b2006}} ([[phab:T174934|T174934]]) === 2017-09-01 === * 19:00 JeanFred: Deploy latest from Git master: {{Gerrit|2be9e28}} ([[phab:T166528|T166528]]), {{Gerrit|d3aa65a}}, {{Gerrit|61086ce}}, {{Gerrit|637a1c0}}, {{Gerrit|6bbcb0a}} ([[phab:T174146|T174146]]), {{Gerrit|566ab17}}, {{Gerrit|eae2643}}, {{Gerrit|9b8e2f4}} === 2017-08-25 === * 11:29 JeanFred: Deploy latest from Git master: {{Gerrit|598e5f9}} ([[phab:T174125|T174125]]) === 2017-08-22 === * 12:16 Lokal_Profil: Deploy latest from Git master: {{Gerrit|166f01d}}, {{Gerrit|1d33262}}, {{Gerrit|7177386}} ([[phab:T173717|T173717]]) === 2017-08-09 === * 14:28 Lokal_Profil: Deploy latest from Git master: {{Gerrit|4bb0c12}} ([[phab:T165759|T165759]], [[phab:T165759|T165759]]), {{Gerrit|35b20ec}}, {{Gerrit|c3be5fe}}, {{Gerrit|196c165}}, {{Gerrit|7b8dcb2}}, {{Gerrit|eac0756}}, {{Gerrit|e15a912}} ([[phab:T172094|T172094]]), {{Gerrit|6c195db}}, {{Gerrit|85b415c}} ([[phab:T112460|T112460]]) * 14:19 Lokal_Profil: Deploy latest from Git master: {{Gerrit|25023b6}}, {{Gerrit|d556d52}}, {{Gerrit|56cd469}}, {{Gerrit|e15709d}}, {{Gerrit|576a6d4}}, {{Gerrit|550fb2d}}, {{Gerrit|57d4f07}}, {{Gerrit|d2980f5}} === 2017-08-02 === * 13:32 Lokal_Profil: Upgdate deployed pywikibot to 3.0.{{Gerrit|20170403}} ([[phab:T112460|T112460]]) === 2017-05-19 === * 17:23 JeanFred: Deploy latest from Git master: {{Gerrit|25023b6}}, {{Gerrit|d556d52}}, {{Gerrit|56cd469}}, {{Gerrit|e15709d}}, {{Gerrit|576a6d4}}, {{Gerrit|550fb2d}}, {{Gerrit|57d4f07}}, {{Gerrit|d2980f5}} * 17:22 JeanFred: Deploy latest from Git master: {{Gerrit|ae1e775}} ([[phab:T138517|T138517]]), {{Gerrit|2bd9781}}, {{Gerrit|04a19e0}} ([[phab:T158911|T158911]]), {{Gerrit|e8dbe35}}, {{Gerrit|4de8898}}, {{Gerrit|ea942d8}}, {{Gerrit|f956ab5}}, {{Gerrit|b63ebac}}, {{Gerrit|dd8c0c8}}, {{Gerrit|63b3bc8}}, {{Gerrit|8b15472}} === 2017-04-20 === * 15:24 JeanFred: Deploy latest from Git master: {{Gerrit|ae1e775}} ([[phab:T138517|T138517]]), {{Gerrit|2bd9781}}, {{Gerrit|04a19e0}} ([[phab:T158911|T158911]]), {{Gerrit|e8dbe35}}, {{Gerrit|4de8898}}, {{Gerrit|ea942d8}}, {{Gerrit|f956ab5}}, {{Gerrit|b63ebac}}, {{Gerrit|dd8c0c8}}, {{Gerrit|63b3bc8}}, {{Gerrit|8b15472}} === 2017-02-17 === * 12:53 JeanFred: Deploy latest from Git master: {{Gerrit|68f3aaa}}, {{Gerrit|a0e053f}} === 2017-02-12 === * 10:03 JeanFred: Deploy latest from Git master: {{Gerrit|0810246}} ([[phab:T156139|T156139]]), {{Gerrit|a9fcbda}}, {{Gerrit|2bf70e1}}, {{Gerrit|435ad34}}, {{Gerrit|7951eab}}, {{Gerrit|4473c94}} === 2017-01-25 === * 18:58 JeanFred: Deploy latest from Git master: {{Gerrit|0810246}} ([[phab:T156139|T156139]]) * 13:30 JeanFred: Deploy latest from Git master: {{Gerrit|8c75342}} === 2017-01-09 === * 21:57 JeanFred: Deploy latest from Git master: {{Gerrit|3639bb0}} ([[phab:T153746|T153746]]), {{Gerrit|daee265}} ([[phab:T153842|T153842]]), {{Gerrit|282b912}}, {{Gerrit|6fd681c}}, {{Gerrit|e00ba31}} ([[phab:T154857|T154857]]) === 2016-12-21 === * 09:34 JeanFred: Deploy latest from Git master: {{Gerrit|26a5049}} ([[phab:T137882|T137882]]) === 2016-11-30 === * 16:12 JeanFred: Deploy latest from Git master: {{Gerrit|0ecce3e}}, {{Gerrit|095108c}}, {{Gerrit|d452948}}, {{Gerrit|f8922c9}}, {{Gerrit|a6d9634}}, {{Gerrit|4ab4148}}, {{Gerrit|d9afa6b}} === 2016-11-08 === * 16:17 JeanFred: Deploy latest from Git master: a70289c (T149258), c7eb06a, b5aeb29, 828e309 === 2016-10-26 === * 23:03 JeanFred: Deploy latest from Git master: c943e89, e3ba148, 063f9e2 (T148772 & T148773), 1db9f42, fe119e1, de2e590, 82147b8, f6ff350 & 63a1819 (T140795) === 2016-10-25 === * 16:44 JeanFred: Deploy latest from Git master: c943e89, e3ba148, 063f9e2 (T148772 & T148773), 1db9f42, fe119e1, de2e590, 82147b8, f6ff350 & 63a1819 (T140795) === 2016-10-19 === * 20:42 JeanFred: Deploy latest from Git master: 213b4a9 (T132644) * 20:15 JeanFred: Deploy latest from Git master: 9f67e94, c64c22e, a21517e === 2016-10-12 === * 09:15 JeanFred: Deployed master from Git: dce7434, 8bfba2a (T145333), faa356e, f87cb70 (T145574), 64ea088 (T143573), 8187df6 (T132641), 26171fe (T138633) === 2016-09-30 === * 19:22 JeanFred: Deployed latest from Git: 08ea28d, fb856b4, 867a229, a454375 === 2016-09-29 === * 10:48 JeanFred: Manually run populate_image_table.py to populate https://commons.wikimedia.org/wiki/Commons:Monuments_database/Indexed_images/Statistics === 2016-09-22 === * 16:29 Lokal_Profil: Deployed latest from Git, 2969082, cb1b318 (T114166) * 08:00 Lokal_Profil: Deployed latest from Git, f439ac4, fab9a23, ff37691, 241a27a, 0a45ad0, 1a526a5, 23ce89e (T144772) === 2016-09-21 === * 13:41 Lokal_Profil: Reverting local changes to categorize_images.py (T146278). The CI mechanisms are there for a reason === 2016-09-16 === * 13:04 JeanFred: Deactivate categorisation for ('it', 'it') as some Wikimedia Commons users are unhappy with it. * 09:40 Lokal_Profil: Added ka.wikipedia to pywikibot user_config === 2016-09-02 === * 16:24 JeanFred: Deployed latest from Git: 1603efb (T142570) === 2016-08-31 === * 20:27 JeanFred: Pulled latest pywikibot (branch 2.0) from Git: 8 commits, including fix for T144438. * 20:02 JeanFred: Deployed latest from Git: 22496d6 (T143481), c964df5, 00ccf8a, 1891ee0, 4458eeb, 55fea41 === 2016-08-09 === * 07:39 Lokal_Profil: Deployed latest from Git, 768b3ac, 30e33ca, 8d7de41 (T141505) === 2016-08-08 === * 09:06 Lokal_Profil: Updating pywikibot (37 commits) === 2016-08-01 === * 10:56 JeanFred: Deployed latest from Git: 6763dbb and dadf805 (T141757) * 07:54 Lokal_Profil: (correction to last line) Deployed latest from Git, 5fe42fe (T111618), 1ec3530, 9a630b5 (T139258) * 07:53 Lokal_Profil: Deployed latest from Git, 5fe42fe (5fe42fe), 1ec3530, 9a630b5 (T139258) === 2016-07-26 === * 13:42 JeanFred: Deployed latest from Git: 0c848e6 (T140488) * 08:30 Lokal_Profil: Add sq.wikipedia to pywikibot user_config === 2016-07-25 === * 13:30 JeanFred: Deployed latest from Git: ff61234, 112c14f, 7e98153, b614e78, aafc788 & 3222df9 (T140795), baad88c, ac388ee === 2016-07-13 === * 09:08 Lokal_Profil: Deployed latest from Git, 38363cb, 6117b13 (T139267), 96af0dc, 20d6da6 === 2016-07-07 === * 22:26 JeanFred: Deployed latest from Git: 1022e80, ad29828, 7ab4bcf, 48c96b4 (T139580), f29995d (T138633), 92f9234 * 22:22 JeanFred: Deployed latest from Git: 6e6cc59 === 2016-07-06 === * 14:28 JeanFred: Deployed latest from Git: 6e6cc59 * 11:23 yuvipanda: restarting tool with 'webservice stop' 'webservice --backend=kubernetes start' * 11:16 JeanFred: Deployed latest from Git: 3323de1 and 6d20267 (T138513) === 2016-07-04 === * 00:22 JeanFred: Deployed latest from Git: 2f9024e === 2016-07-03 === * 15:16 JeanFred: Deployed latest from Git: 76383f9 * 14:44 JeanFred: Deployed latest from Git: 1fdeff6 * 13:01 JeanFred: Deployed latest from Git: 0154a31, 9a4d05b, 4b6c343, 39c5409 (T138633) === 2016-06-30 === * 18:43 JeanFred: Deployed latest from Git: 795a396 (T136351), 4c8d9e3, e09e3d7 (T138519), 7840307, 4bdddd0 (T138606) === 2016-06-24 === * 22:48 Lokal_Profil: Recreated source tables (T138606) === 2016-06-23 === * 15:33 JeanFred: Added column wd_item to monuments_all, by copying monuments_all to tmp, alter table, and rename back to avoid locks. (T55808) * 14:33 JeanFred: Running 'ALTER TABLE `monuments_all` ADD COLUMN `wd_item` varchar(255) DEFAULT NULL;' ; taking a while... * 14:27 JeanFred: Stopped webservice, restarted and tying on trusty (`webservice --release=trusty start`) * 14:24 JeanFred: Monuments API currently down because of PHP 5.5 syntax, and host running 5.3 * 14:01 JeanFred: Deployed latest from Git: 4030533, bb95d23 (T55808), bd96bbd (T138377), 0a3247d, be9b1a9 (T134764) * 09:42 multichill: Added the fa Wikipedia account for Pywikibot. This should fix the broken unused image job === 2016-06-22 === * 16:56 JeanFred: Deployed latest from Git: 76c6dd6c (T138377) * 15:42 JeanFred: Deployed latest from Git: 4be4f04, c280649, d667e19 (T136566 & T137543), e867c45, 0a09a20 === 2016-06-10 === * 14:09 JeanFred: Deployed latest from Git: 74d9086 * 10:30 JeanFred: Deployed latest from Git: d25eda5 (T136704) === 2016-06-06 === * 15:46 JeanFred: Deployed latest from Git: b77b7c2 (T137096) === 2016-06-03 === * 14:00 JeanFred: Deployed latest from Git: 5ca0fdc, 688141a, 1af3b3d, 0262786, d812cdf (T134565) === 2016-05-23 === * 12:50 Lokal_Profil: Deployed latest from Git, 50915bf (T55688) === 2016-05-18 === * 10:50 JeanFred: Deployed latest from Git: 39780e2, 977c07f, 5f4532c, b7b297b (T135502 & T55688), 476267f (T39422) === 2016-05-12 === * 13:04 JeanFred: Deployed latest from Git: ebcd48c (T134727) === 2016-05-09 === * 11:15 JeanFred: Deployed latest from Git: 4f82b32 (T134728) === 2016-05-06 === * 19:47 Lokal_Profil: Deployed latest pywikibot-core/2.0 from Git * 19:26 Lokal_Profil: Deployed latest from Git, a724279 , d9ae73d (reverts 766d814 ) * 18:42 Lokal_Profil: Deployed latest from Git, 2d3ee40 (T39974), 766d814, e2fac07 and d2c242a (T39422) * 14:58 JeanFred: Deployed latest from Git: e5a9f01 and d509343 (T134567) * 12:15 JeanFred: Deployed latest from Git: db46042, c765e76, b5a731a, 7c27207, d4de720 (T134236), e7823ab & c83003b (T132647), 615ab28 === 2016-04-20 === * 13:01 JeanFred: Deployed latest from Git, 48bce77 and dfbff9b (T132029) === 2016-04-01 === * 22:51 multichill2: JeanFred did a git pull for [[Phab:T131344]] and others === 2016-03-31 === * 09:14 multichill: Commented out the Russian Wikipedia in user-config.py for [[Phab:T131344]] === 2016-03-16 === * 20:45 multichill: jsubbed populate_image_table.py for https://phabricator.wikimedia.org/T130107 (see crontab -l for exact command) === 2015-08-30 === * 14:38 multichill: Made local change to unused_images.py to get it to work, see https://phabricator.wikimedia.org/T110829 * 09:14 multichill: Updated ~/pywikibot to latest version, but still getting a FamilyMaintenanceWarning === 2015-08-22 === * 13:35 JeanFred: After backporting all local changes to Gerrit, updating local checkout to latest Git version. === 2015-07-15 === * 16:50 JeanFred: Checked out pywikibot-core === February 23 === * 20:30 multichill: Merged https://gerrit.wikimedia.org/r/192258 , but can't deploy it because api/includes/FormatHtml.php has local (I18n) changes. Anyone feels like fixing? === December 21 === * 11:49 multichill: After the toolserver.org dns move the http://toolserver.org/~erfgoed/ redirects seem to be broken. Akoopal mentioned this, see https://lists.wikimedia.org/pipermail/labs-l/2014-December/003216.html === September 20 === * 15:56 multichill: Fixed https://bugzilla.wikimedia.org/show_bug.cgi?id=70806 and deployed new 2 new sk tables === August 27 === * 18:05 multichill: Added Oren to the project === July 10 === * 13:25 multichill: dns was broken, because of that api has been acting up for the last 2 (?) hours * 11:46 lokal-profil: Corrected commands at [[commons:Commons:Monuments_database/Harvesting]] * 11:20 multichill: Created ~/temp so that the change in https://gerrit.wikimedia.org/r/#/c/145254/1/api/includes/Defaults.php doesn't produce an error any more * 09:57 lokal-profil: Images and markers in Kml now load from // instead of <nowiki>http://</nowiki>, [https://gerrit.wikimedia.org/r/#/c/145256/ gerrit] * 09:42 lokal-profil: Added se-arbetsliv a list for Working Life Museums in Sweden, [https://gerrit.wikimedia.org/r/#/c/145244/ gerrit] * 09:42 lokal-profil: Updated Default.php to point to toollabs instead of toolserver, [https://gerrit.wikimedia.org/r/#/c/145254/ gerrit] === June 29 === * 21:43 multichill: I put the RCE mysql conversion [[User:Akoopal]] made in ~/rce-nl-data . Still need to import it in Mysql to be useful. Data is CC0 * 19:30 multichill: Web service was down for all accounts. Back up and running. Api seems to have been down from 19:30 to 21:15 (Amsterdam time) * 13:36 multichill: Burned the old ~erfgoed account on the Toolserver and uploading the backup to ~/toolserver_backup/ === June 19 === * 17:10 multichill: Fixed database_statistics.py after notification on https://commons.wikimedia.org/wiki/Commons_talk:Monuments_database/Statistics#Bug_in_the_URL . Still have to commit it === June 15 === * 11:17 multichill: Did some hacks with Krinkle to get i18n working(ish) again (api.php and html formatters). Still need to commit it === June 14 === * 19:28 multichill: Did the [https://www.wikidata.org/wiki/Wikidata_talk:Cultural_heritage_task_force#Rijksmonumenten_import first steps to import the data to Wikidata]. I wonder when we can deprecate the monument database * 19:26 multichill: I sent out the Toolserver will die email. http://lists.wikimedia.org/pipermail/labs-l/2014-June/002672.html . I plan to drop the database p_erfgoed_p on the 21st. * 11:33 multichill: Added Lokal Profil per request at [https://commons.wikimedia.org/w/index.php?title=User_talk:Multichill&oldid=126591871#Heritage_at_labs Commons] === June 7 === * 16:57 multichill: While updating documentation I found https://phabricator.wikimedia.org/diffusion/GWLA/ . Should probably be dropped, everything is in https://phabricator.wikimedia.org/diffusion/THER/history/ * 16:12 multichill: http://toolserver.org/~erfgoed/ now redirects to http://tools.wmflabs.org/heritage/ . Didn't move everything so that might give some 404's * 16:06 multichill: prox_search completed without problems. update_monuments.sh should now run without failures. * 15:52 multichill: symlinked ~/prox_search, fixed path (need to commit that), create_table_prox_search.sql , doing manual run * 15:35 multichill: Had to increase memory for statitics to 512M. Still need to commit that. jsubbed build_stats_test again and it finished with Memory usage: 396588928 * 15:04 multichill: Symlinked ~/public_html/maintenance and create tables statistics and statisticsct. jsubbed build_stats_test to test it * 14:49 multichill: Fixed populate_adm_tree.php and populated the table. Still need to commit it === June 4 === * 20:04 multichill: Managed to get the image database updated by switching latin1 -> utf8. Still have to commit. https://commons.wikimedia.org/wiki/Commons:Monuments_database/Indexed_images/Statistics * 19:58 multichill: Pointed https://commons.wikimedia.org/wiki/Template:Monuments_database_more_images to the api on labs. Was 15K hits on the Toolserver (?!) * 19:23 multichill: https://gerrit.wikimedia.org/r/137398 pretty images live, see http://tools.wmflabs.org/heritage/api/api.php?action=images&imcountry=ad&imid=100&format=html&props=img_name * 19:17 multichill: Fixed the mysqldump and enabled /data/project/heritage/erfgoedbot/populate_image_table.py === June 1 === * 20:18 multichill: Set up cron to run the update_monuments job every night. Some parts of it will still fail. * 20:05 multichill: Some tweaks in https://gerrit.wikimedia.org/r/136683 database is filled. Api is working (admintree and statistics still missing) * 17:14 multichill: Updated ~/bin/create_all_monuments_tables.sh and created 105 tables. Fired up update_database.py to fill the database * 17:01 multichill: Pull pywikibot (compat) and heritage. Symlinked it and setup the bot * 16:46 multichill: Moved erfgoedbot, public_html & pywikipedia to ~/old/. to make room * 16:41 multichill: Fixed ~/.database.inc , still have to do the i18n part * 16:35 multichill: Cleaned out some code in https://gerrit.wikimedia.org/r/136649 and merged it * 16:18 multichill: Created the s51138__heritage_p database * 16:16 multichill: Replaced the .my.cnf with the right credentials <noinclude>[[Category:SAL]]</noinclude> k39m7psbmwan28qny73gy4byyj6xizi Nova Resource:Quarry/SAL 498 16087 2309652 2305258 2025-06-09T07:46:11Z Stashbot 7414 taavi: delete redis pod stuck in Completed with no futher explanation why 2309652 wikitext text/x-wiki === 2025-06-09 === * 07:46 taavi: delete redis pod stuck in Completed with no futher explanation why === 2025-05-25 === * 07:56 taavi: reboot quarry-127b-3lqizumia4xn-node-1 [[phab:T395201|T395201]] === 2025-04-17 === * 11:03 wmbot~dcaro@acme: END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) * 10:55 taavi: deploying [[phab:T392138|T392138]] [[phab:T392141|T392141]] [[phab:T392143|T392143]] patches * 07:24 wmbot~dcaro@acme: START - Cookbook wmcs.openstack.cloudvirt.vm_console * 07:16 wmbot~dcaro@acme: END (PASS) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=0) * 07:14 wmbot~dcaro@acme: START - Cookbook wmcs.openstack.cloudvirt.vm_console * 07:14 wmbot~dcaro@acme: END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) * 07:13 wmbot~dcaro@acme: START - Cookbook wmcs.openstack.cloudvirt.vm_console * 07:13 wmbot~dcaro@acme: END (ERROR) - Cookbook wmcs.openstack.cloudvirt.vm_console (exit_code=255) * 07:13 wmbot~dcaro@acme: START - Cookbook wmcs.openstack.cloudvirt.vm_console * 00:20 bd808: `kubectl delete pod -n quarry --all` ([[phab:T392107|T392107]]) * 00:12 bd808: `kubectl -n quarry delete pod/redis-676b955f95-tkbb7` ([[phab:T392107|T392107]]) === 2025-04-16 === * 18:26 taavi@cloudcumin1001: END (PASS) - Cookbook wmcs.vps.add_user_to_project (exit_code=0) for user 'taavi' in role 'reader' * 18:26 taavi@cloudcumin1001: START - Cookbook wmcs.vps.add_user_to_project for user 'taavi' in role 'reader' === 2024-07-17 === * 17:47 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0) * 17:45 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:44 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=0) * 17:43 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:42 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99) * 17:42 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:34 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99) * 17:34 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:33 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99) * 17:33 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:28 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99) * 17:28 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance * 17:25 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.rebuild_dbinstance (exit_code=99) * 17:25 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.rebuild_dbinstance === 2024-06-21 === * 02:43 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd * 02:36 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 02:28 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 02:22 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 02:21 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd * 02:21 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 02:16 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 02:10 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 02:07 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 02:02 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 01:51 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 01:47 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 01:46 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 01:42 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 01:33 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd * 01:26 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 01:25 andrew@cloudcumin1001: END (FAIL) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=99) for server tbd * 01:18 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd * 01:16 andrew@cloudcumin1001: END (PASS) - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs (exit_code=0) for server tbd * 01:12 andrew@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_dbinstance_to_ovs for server tbd === 2024-06-20 === * 12:11 taavi@cloudcumin1001: START - Cookbook wmcs.openstack.migrate_project_to_ovs === 2024-06-18 === * 19:33 andrewbogott: rebuilt magnum cluster using g4/ovs flavors === 2024-04-01 === * 12:48 Rook: quarry moved to k8s [[phab:T349032|T349032]] === 2024-03-11 === * 13:10 andrewbogott: deleting long-shutdown quarry-puppet-master-02 === 2024-02-02 === * 20:00 andrewbogott: rebuilding trove instances with new antelope guest image === 2023-12-04 === * 15:16 dcaro: re-enable puppet that has been too long disabled ([[phab:T348748|T348748]]) === 2023-10-24 === * 12:56 Rook: minikube helm chart [[phab:T301469|T301469]] === 2023-05-29 === * 11:49 framawiki: deployed https://github.com/toolforge/quarry/pull/22 on 3 prod servers === 2023-05-27 === * 21:50 framawiki: shutdown potentially unused servers: quarry-nfs-1, quarry-puppet-master-02, quarry-dev-03 (this last one is started when there is need to test patchs) === 2023-05-19 === * 09:59 wm-bot2: added user isaacj to the project as reader ([[phab:T337019|T337019]]) - cookbook ran by arturo@endurance === 2023-04-18 === * 19:24 Rook: remove db entries ending in semicolon {{Gerrit|91d66e53de3b0ab754f89e84f85673212256adab}} === 2023-02-28 === * 12:17 Rook: Don't crash if multiple columns share a name {{Gerrit|a2c6d3fbf4c52b0967e4016ccdf3910db330cf0d}} [[phab:T170464|T170464]] * 11:48 Rook: More dropdown options for page length {{Gerrit|09fb982355c5e3856175e32c5b216f815fbd4f31}} [[phab:T126540|T126540]] === 2023-02-27 === * 14:41 Rook: enable search {{Gerrit|00064ea160bb856513c2cdcf64f629e654b96dc2}} [[phab:T90509|T90509]] === 2023-01-04 === * 16:41 Rook: Fix various outdated URLs in Quarry website footer === 2022-12-04 === * 16:23 dcaro: restarted uwsgi on quarry-web-02 as it was getting out of memory errors (and failing puppet) === 2022-09-20 === * 10:23 Rook: Update favicon while query running #7 [[phab:T316307|T316307]] {{Gerrit|ea7a4a29e4c98135ac2d22eada8c81478fda990c}} === 2022-09-09 === * 20:30 Rook: master branched moved to main * 20:29 Rook: repo moved to github https://github.com/toolforge/quarry === 2022-08-29 === * 11:42 Rook: 826948: update XlsxWriter plugin {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/826948 === 2022-08-10 === * 13:59 Rook: 821344: api: return consistently {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/821344 * 12:41 Rook: 821283: Switch string and pipe {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/821283 === 2022-08-08 === * 16:43 Rook: Escape '{{!}}' from wikitable output {{Gerrit|c456f3ce007b9fae44e59677c4a7fcdf38564e67}} [[phab:T308362|T308362]] === 2022-07-06 === * 08:40 dcaro: rebooting worker-04 due to being unable to ssh to it (things started segfaulting, then too much work for irq) === 2022-06-23 === * 10:23 Rook: 807534: Introduce black formatting to quarry {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/807534 === 2022-06-22 === * 20:50 Rook: 806474: Get non-coincidental history entries. {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/806474 === 2022-06-21 === * 11:48 Rook: 806504: Show username on 404 page when logged in {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/806504 === 2022-06-16 === * 21:13 Rook: 806271: Prettify User not found page {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/806271 {{Gerrit|8002e6f06c64568441d7fc5ddc70ea2525a4c6fb}} === 2022-06-06 === * 15:49 Rook: 791669: Update stop status directly and catch error {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/791669 {{Gerrit|57a0e45dfe5d4736e79dff07e0054fb511aca718}} === 2022-05-17 === * 12:23 Rook: Deploying: 791606: Return 404 on query ids that do not exist {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/791606 {{Gerrit|e19b0a5e706f7e853f66b0d376c43cd499d8a0e2}} * 12:15 Rook: Deploying: 792277: query.py: Make quarry history descending {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/792277 {{Gerrit|538d322ffd18dd7ad1e53644cffc3946d1c42990}} === 2022-05-16 === * 17:32 Rook: 788438: Use vars.qrun_id when stopping query {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/788438 {{Gerrit|43f7f0c9f56ac01f1a92b200769ddbe782055381}} * 15:48 Rook: 176506: Remember recent queries filter last used by a user. {{!}} https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/176506 {{Gerrit|3763507f1f7faa3bb44b196046fc5d153ce03924}} * 15:26 Rook: deploying link to database names https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/752034 {{Gerrit|6322dc735d465b9e6e65032a60aabf129d3073bc}} * 15:17 Rook: deploying entry for blank queries https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/791779 {{Gerrit|92de0d5737b4d7a611e4eb635794988f818ce4cd}} * 15:09 Rook: deploying truncation for history entries {{Gerrit|9ac970d861c5c12dbbafb065ce45ad1ba4fd5fec}} === 2022-04-18 === * 18:34 Rook: update phab links to prefilled ticket https://phabricator.wikimedia.org/T303028 * 17:30 Rook: exposing query history https://phabricator.wikimedia.org/T100982 === 2022-04-04 === * 13:00 taavi: delete quarry-db-01 === 2022-03-25 === * 12:04 dcaro: rebooting quarry-worker-04.quarry.eqiad1.wikimedia.cloud due to stuck nfs ([[phab:T304681|T304681]]) === 2022-03-21 === * 17:30 Rook: updating home page link to profile [[phab:T85175|T85175]] === 2022-02-20 === * 19:49 andrewbogott: moving nfs service from quarry-nfs-1 (bullseye) to quarry-nfs-2 (buster), testing to see if [[phab:T302154|T302154]] is a kernal or nfs-version issue * 19:23 taavi: hard rebooted quarry-nfs-1 again [[phab:T302154|T302154]] === 2022-02-19 === * 14:04 taavi: reboot quarry-nfs-1 [[phab:T302154|T302154]] === 2022-02-11 === * 21:11 andrewbogott: switching shared nfs project dir (again) to internal nfs server quarry-nfs-1 === 2022-02-10 === * 19:13 andrewbogott: rebooting all VMs to switch to new NFS server === 2022-01-29 === * 23:42 taavi: deploying https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/749756/ === 2021-09-27 === * 17:23 mdipietro: added stopped status [[phab:T289349|T289349]] === 2021-09-07 === * 13:10 mdipietro: tab will close autocomplete window [[phab:T289872|T289872]] === 2021-09-03 === * 17:39 andrewbogott: restarting celery workers and reloading web UI to pick up timeout changes * 16:45 bstorm: set live wait_timeout variable to 28800 (the default) on the trove instance [[phab:T290291|T290291]] === 2021-09-01 === * 21:56 andrewbogott: switched /srv/quarry to branch 'master' on quarry-worker-03, quarry-worker-04, quarry-web-02 * 18:26 bstorm: started instance quarry-dbbackup-01 [[phab:T289568|T289568]] * 13:07 mdipietro: Updated to Debian Buster/python 3.7 [[phab:T288528|T288528]] === 2021-08-16 === * 10:25 dcaro: Reverting deploy of [[phab:T287471|T287471]] - saved queries fail to show the DB field, will open a task * 10:21 dcaro: Deploying [[phab:T287471|T287471]] 2/2 - updating code on the web and worker servers * 10:19 dcaro: Deploying [[phab:T287471|T287471]] 1/2 - creating DB index === 2021-05-06 === * 17:57 bstorm: restarting web service to remove banner for wikireplicas upgrade * 17:49 bstorm: cleared out tmp files created by quarry web service that had filled the disk with find [[phab:T282171|T282171]] === 2021-04-23 === * 18:51 Framawiki: ran apt updates without issues on all 4 servers. [[phab:T266386|T266386]] looks fixed. === 2021-04-07 === * 21:06 bstorm: deploying regex fixes [[phab:T278715|T278715]] === 2021-04-02 === * 09:05 Framawiki: shutdown quarry-dev server, normally unused now === 2021-03-26 === * 19:27 bstorm: deploying changes to the replica class and restarting things [[phab:T278544|T278544]] === 2021-03-25 === * 22:15 bstorm: removing the querykiller role [[phab:T264254|T264254]] * 22:03 bstorm: restarting celery worker processes to fix connection cleanup [[phab:T264254|T264254]] * 22:01 bstorm: restarting web interface for a small fix for the database field display [[phab:T264254|T264254]] === 2021-03-23 === * 21:45 bstorm: restarting quarry services for the meta_p and centralauth issue [[phab:T264254|T264254]] * 19:17 bstorm: finished updating quarry for multiinstance replicas [[phab:T264254|T264254]] * 18:51 bstorm: running the multiinstance migration script [[phab:T264254|T264254]] * 18:45 bstorm: stopping the quarry web service for the upgrade process * 18:43 bstorm: `git stash`ing the in-place test of sentinel in the code checkout * 15:08 dcaro: systemctl restart mariadb on quarry-db-01 brought it back to life ([[phab:T278230|T278230]]) * 14:53 andrewbogott: service mariadb start on quarry-db-01 === 2021-02-20 === * 13:18 wm-bot: framawiki: Deployed {{Gerrit|f51f9a9}} on -web-01 [[phab:T275277|T275277]] (not yet merged) * 11:09 wm-bot: framawiki: Deployed {{Gerrit|15a315a}} on -web-01 [[phab:T254847|T254847]] (`Update document title on title change`) === 2021-02-15 === * 19:24 Reedy: stash'd patch saved to quarry-web-01.quarry.eqiad1.wikimedia.cloud:/root/T274815.patch [[phab:T274815|T274815]] * 19:22 Reedy: [[phab:T274815|T274815]] filed with the login failure traceback * 19:21 Reedy: re-enabled puppet on quarry-web-01.quarry.eqiad1.wikimedia.cloud as it had been disabled for a week * 19:20 Reedy: `git stash` framawiki changes as it was breaking login === 2021-02-12 === * 17:12 bstorm: started quarry-dev-01 [[phab:T264254|T264254]] === 2020-12-15 === * 17:43 Reedy: quarry-worker-02 `systemctl restart uwsgi-quarry-web.service` again, after pulling patch for [[phab:T270195|T270195]] * 17:43 Reedy: quarry-worker-01 `systemctl restart uwsgi-quarry-web.service` again, after pulling patch for [[phab:T270195|T270195]] * 17:40 Reedy: quarry-web-01 `systemctl restart uwsgi-quarry-web.service` again, after pulling patch for [[phab:T270195|T270195]] * 17:31 Reedy: quarry-web-01 `systemctl restart uwsgi-quarry-web.service` * 17:26 Reedy: `find /tmp -type f -mtime +30 -delete;` on quarry-web-01 [[phab:T270198|T270198]] * 17:23 Reedy: `apt-get clean && apt-get autoclean` on quarry-web-01 [[phab:T270198|T270198]] === 2020-10-20 === * 16:11 bstorm: restarted mariadb on quarry-db-01 so it pointed to the right data directory * 16:00 andrewbogott: rebooting quarry-web-01; lots of cruft in /tmp * 15:56 andrewbogott: restarting nginx on quarry-web-01 === 2020-09-03 === * 17:14 Framawiki: `framawiki@quarry-web-01:/tmp$ find /tmp/* -mtime +360 -user www-data -exec sudo rm -v <nowiki>{</nowiki><nowiki>}</nowiki> \;` 775 files deleted for 10G. Again. [[phab:T261909|T261909]] === 2020-07-08 === * 19:23 Framawiki: `framawiki@quarry-web-01:/tmp$ find /tmp/* -mtime +360 -user www-data -exec sudo rm -v <nowiki>{</nowiki><nowiki>}</nowiki> \;` 778 files deleted for 10G. === 2020-03-06 === * 19:32 zhuyifei1999_: changed to analytics replica for database queries and restarted celery workers [[phab:T246970|T246970]] === 2020-02-26 === * 20:12 jeh: restart quarry-web-01 and quarry-worker-01 === 2020-01-14 === * 06:05 zhuyifei1999_: applied [[phab:T242355|T242355]]-ver2.patch [[phab:T242355|T242355]] === 2020-01-01 === * 06:47 zhuyifei1999_: Deployed {{Gerrit|3076232}} on quarry-web-01 [[phab:T147711|T147711]] * 06:06 zhuyifei1999_: Deployed {{Gerrit|d7ddab8}} on quarry-web-01 [[phab:T147711|T147711]] * 05:58 zhuyifei1999_: Deployed {{Gerrit|cd69638}} on quarry-web-01 [[phab:T147711|T147711]] === 2019-11-14 === * 21:59 zhuyifei1999_: `zhuyifei1999@quarry-web-01:/tmp$ find /tmp/* -mtime +360 -user www-data -exec sudo rm -v <nowiki>{</nowiki><nowiki>}</nowiki> \;` [[phab:T238375|T238375]] === 2019-10-19 === * 15:39 wm-bot: framawiki: Deployed {{Gerrit|90a1bef}} on -web-01 (`query-status.html: fix compiled.js`) * 15:18 wm-bot: framawiki: Deployed {{Gerrit|1f297c9}} on -web-01 [[phab:T205214|T205214]] (`query-status.html: hide Explain button until bug is solved`) === 2019-10-02 === * 17:20 mutante: - mariadb::packages should now work on buster too, fyi === 2019-10-01 === * 11:51 zhuyifei1999_: restart celery-quarry-worker.service === 2019-06-28 === * 18:45 zhuyifei1999_: Deployed {{Gerrit|2f7ee60}} to quarry-web-01 * 14:37 bstorm_: changed to web replica for database queries and restarted celery workers === 2019-06-21 === * 21:34 wm-bot: framawiki: Deployed {{Gerrit|5d6844e}} on -web-01 === 2019-06-14 === * 20:23 wm-bot: framawiki: Deployed {{Gerrit|b303ce8}} on -web-01 === 2019-05-25 === * 12:58 framawiki: block spammer `INSERT INTO user_group (user_id, group_name) VALUES (3765, "blocked");` * 12:22 wm-bot: framawiki: Deployed {{Gerrit|cc0c0a7}} on -web-01 [[phab:T224300|T224300]] === 2019-05-24 === * 21:00 zhuyifei1999_: masked uwsgi service on quarry-web-01 to prevent future mess-ups * 20:59 zhuyifei1999_: reenabled puppet on quarry-web-01, should use uwsgi-quarry-web service not uwsgi service * 20:51 zhuyifei1999_: disabled puppet on quarry-web-01 because it wants uwsgi dead === 2019-05-12 === * 18:01 wm-bot: framawiki: Deployed {{Gerrit|3e25078}} on -web-01 [[phab:T223013|T223013]] * 17:57 wm-bot: framawiki: Deployed {{Gerrit|3776b5f}} on -web-01 [[phab:T223018|T223018]] * 12:51 wm-bot: framawiki: Deployed {{Gerrit|7a13ab4c68e16201f11cce664ddeaf64805f3c2b}} on -web-01 === 2019-05-11 === * 22:10 framawiki: re-enable puppet [[phab:T223018|T223018]] * 22:00 framawiki: disabling puppet temporary on -web-01 to test nginx conf [[phab:T223018|T223018]] * 19:25 framawiki: block spammer `INSERT INTO user_group (user_id, group_name) VALUES (3927, "blocked"), (3958, "blocked"), (3984, "blocked"), (3985, "blocked"), (3986, "blocked");` === 2019-05-10 === * 14:00 andrewbogott: restarting uwsgi-quarry-web and nginx on quarry-web-01 * 07:09 zhuyifei1999_: restarted uwsgi. nginx reports 502 === 2019-04-05 === * 18:48 zhuyifei1999_: checked out FETCH_HEAD on quarry-web-01 [[phab:T209226|T209226]] * 18:43 zhuyifei1999_: applied 0001-SECURITY-escape-CSV-injections.patch on quarry-web-01 and restarted uwsgi [[phab:T209226|T209226]] === 2019-03-16 === * 08:31 framawiki: restarted uwsgi to deal with 502 nginx errors `sudo systemctl restart uwsgi-quarry-web` === 2019-03-02 === * 18:55 framawiki: block spammer https://quarry.wmflabs.org/Twc93521 `INSERT INTO user_group (user_id, group_name) VALUES (3734, "blocked");` === 2019-02-21 === * 09:29 gtirloni: applied CSP change [[phab:T214637|T214637]] * 09:22 gtirloni: updated and rebooted all servers (debian 9.8) === 2019-02-20 === * 20:59 wm-bot: framawiki: Deployed {{Gerrit|8f72587}} on -web-01 [[phab:T216581|T216581]] * 20:38 framawiki: re-activating puppet on -web-01, csp conf looks good [[phab:T214637|T214637]] * 20:15 framawiki: disabling puppet temporary on -web-01 to test csp conf [[phab:T214637|T214637]] === 2019-02-18 === * 21:48 framawiki: Deployed {{Gerrit|6bda39e}} on -web-01 [[phab:T215831|T215831]] === 2018-12-24 === * 21:40 zhuyifei1999_: Deployed {{Gerrit|2a51a54}} on -web-01 [[phab:T212598|T212598]] === 2018-12-02 === * 21:37 zhuyifei1999_: deployed till {{Gerrit|f9ad985}} * 09:35 framawiki: deployed {{Gerrit|575fc1c}} [[phab:T209783|T209783]] and {{Gerrit|06a1f9f}} [[phab:T205151|T205151]] on quarry-web-01 === 2018-11-27 === * 18:54 zhuyifei1999_: triggered OOM killer on quarry-worker-02 9 times [[phab:T188564|T188564]] === 2018-11-16 === * 19:09 framawiki: deployed {{Gerrit|4f0b830}} to quarry-web-01 [[phab:T71264|T71264]] === 2018-11-12 === * 20:37 framawiki: deployed till {{Gerrit|ed511d1}} on quarry-web-01 [[phab:T205222|T205222]] [[phab:T205221|T205221]] === 2018-11-05 === * 18:49 zhuyifei1999_: `UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 0) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -30 MINUTE);` 286 rows affected. * 18:43 framawiki: migration is over [[phab:T207677|T207677]] * 18:33 zhuyifei1999_: flushed redis with flushall * 18:22 zhuyifei1999_: unset db read-only `SET GLOBAL read_only = 0; UNLOCK TABLES;` [[phab:T207677|T207677]] * 18:01 andrewbogott: moving instances from eqiad to eqiad1-r * 17:40 zhuyifei1999_: set db read-only `FLUSH TABLES WITH READ LOCK; SET GLOBAL read_only = 1;` [[phab:T207677|T207677]] * 17:24 zhuyifei1999_: shutting down all workers `sudo kill -TERM $(systemctl show -p MainPID celery-quarry-worker.service {{!}} cut -d= -f2)` [[phab:T207677|T207677]] === 2018-11-04 === * 22:24 zhuyifei1999_: checked out FETCH_HEAD {{Gerrit|8c065d0}}, previous head was {{Gerrit|71643b6}} on quarry-web-01 * 17:37 framawiki: deployed {{Gerrit|c10fc32}} and {{Gerrit|71643b6}} on quarry-web-01 === 2018-10-21 === * 16:19 zhuyifei1999_: deployed {{Gerrit|99db770}} to workers [[phab:T126888|T126888]] === 2018-10-15 === * 18:52 framawiki: deployed {{Gerrit|c1dfde7}} on quarry-web-01, quarry-worker-0{1,2} [[phab:T126888|T126888]] === 2018-10-06 === * 16:30 framawiki: deployed {{Gerrit|8550956}} on quarry-web-01 === 2018-09-26 === * 13:24 zhuyifei1999_: restarted mariadb on -db-01 after max_allowed_packet fix === 2018-09-25 === * 12:09 arturo: make myself projectadmin === 2018-09-24 === * 22:33 framawiki: manually clear queries and resultsets where userid=3214 [[phab:T205286|T205286]] * 22:31 framawiki: `update query_run set status=3 where id=290865;` on quarry-db-01 to mark a ghost query as killed on the ui * 21:50 framawiki: `select * from query_run where id=290865;` on quarry-db-01 to mark a ghost query as killed on the ui * 17:54 framawiki: quarry-db-01: `INSERT INTO user_group (user_id, group_name) VALUES (3214, "blocked");` [[phab:T205286|T205286]] * 17:53 framawiki: deployed {{Gerrit|028a292}} on quarry-web-01 [[phab:T205286|T205286]] [[phab:T104322|T104322]] === 2018-09-23 === * 16:56 zhuyifei1999_: deployed till {{Gerrit|e74f575}} on -web-01, [[phab:T202588|T202588]] [[phab:T205153|T205153]] === 2018-09-21 === * 19:29 framawiki: deployed {{Gerrit|4b01077}} on quarry-web-01 [[phab:T204964|T204964]] === 2018-09-19 === * 16:57 framawiki: deployed {{Gerrit|4994570}} to quarry-web-01 [[phab:T204805|T204805]] === 2018-09-17 === * 20:06 framawiki: deployed till {{Gerrit|e59152e}} [[phab:T192696|T192696]] [[phab:T204432|T204432]] [[phab:T204226|T204226]] [[phab:T204430|T204430]] to quarry-web-01 * 17:41 framawiki: deployed {{Gerrit|e8e6e02}} to quarry-web-01 [[phab:T73064|T73064]] === 2018-09-16 === * 16:28 zhuyifei1999_: Deployed till {{Gerrit|2081a97}} on -web-01 [[phab:T204435|T204435]] === 2018-09-13 === * 22:10 zhuyifei1999_: purging stuffs created by labs_debrepo [[phab:T153615|T153615]] * 21:40 zhuyifei1999_: deployed {{Gerrit|8b4bde0}} to quarry-web-01 [[phab:T204277|T204277]] * 19:19 framawiki: deleted legacy instances quarry-main-01 and quarry-runner-0{1,2}, migration is over [[phab:T202588|T202588]] * 19:10 framawiki: copy /var/log/nginx from legacy main-01 to /data/project/nginx-logs-legacy-20180913-framawiki for further analysis [[phab:T202588|T202588]] [[phab:T197256|T197256]] === 2018-09-12 === * 21:21 zhuyifei1999_: unset read-only again on new database * 21:15 zhuyifei1999_: `sudo chown quarry:quarry /data/project/quarry/ -Rv` [[phab:T202588|T202588]] * 21:13 zhuyifei1999_: set read-only again on new database because new quarry's UID is 498 [[phab:T202588|T202588]] * 21:10 zhuyifei1999_: unset read-only again on new database * 21:09 zhuyifei1999_: rm'ed /var/lib/mysql on -db-01, we are using /srv/sqldata/ now, and if something goes really better have a loud failure * 21:03 zhuyifei1999_: deployed {{Gerrit|8b4bde0}} to -web-01 [[phab:T204161|T204161]] * 21:01 zhuyifei1999_: deployed {{Gerrit|461e56c}} * 20:45 zhuyifei1999_: set read-only again on new database due to a bug in worker code * 20:42 zhuyifei1999_: unset read-only on new database [[phab:T202588|T202588]] * 20:41 framawiki: switched quarry.wmflabs.org proxy to new quarry-web-01.quarry.eqiad.wmflabs [[phab:T202588|T202588]] * 20:27 zhuyifei1999_: backed up old db to /data/project/dump-2018-09-12.sql and restoring to new server [[phab:T202588|T202588]] * 20:03 zhuyifei1999_: set quarry-main-01 mariadb read-only [[phab:T202588|T202588]] * 20:02 zhuyifei1999_: stopped celery-quarry-worker on quarry-runner-0{1,2} [[phab:T202588|T202588]] * 19:45 zhuyifei1999_: created new quarry database and user in quarry-db-01.quarry.eqiad.wmflabs [[phab:T202588|T202588]] === 2018-09-11 === * 17:22 zhuyifei1999_: doing another backup of main db: `sudo mysqldump quarry {{!}} sudo tee /data/project/dump-$(date '+%Y-%m-%d').sql > /dev/null` [[phab:T202588|T202588]] * 17:14 zhuyifei1999_: disabling puppet on quarry-main-01, quarry-runner-0{1,2} [[phab:T202588|T202588]] === 2018-09-07 === * 21:06 zhuyifei1999_: reverted hotpatch, deployed till {{Gerrit|3375dc3}} * 20:47 zhuyifei1999_: hotpatch /etc/uwsgi/apps-enabled/quarry-web.ini processes 8 -> 1 for some gdb-ing * 19:56 framawiki: deployed {{Gerrit|501695f}} to quarry-main-01 ([[phab:T202588|T202588]]) * 18:11 framawiki: deployed {{Gerrit|769cace}} to quarry-main-01 ([[phab:T202588|T202588]]) === 2018-08-24 === * 21:21 framawiki: deployed {{Gerrit|4814d58}} ([[phab:T124625|T124625]]) to quarry-main-01 === 2018-05-31 === * 15:52 zhuyifei1999_: live-patch `/srv/quarry/quarry/web/connections.py` on `quarry-main-01` and restart uwsgi === 2018-05-07 === * 22:40 framawiki: deployed {{Gerrit|24038e3}} to quarry-main-01 === 2018-05-04 === * 02:05 zhuyifei1999_: Deployed {{Gerrit|6069904}} === 2018-05-02 === * 18:23 framawiki: deployed {{Gerrit|af2f7e6}} to quarry-main-01 * 18:04 zhuyifei1999_: Deployed {{Gerrit|f4e86f1}} and restarted everything === 2018-04-24 === * 21:36 framawiki: removing old /srv/venv on quarry-main-01 [[phab:T192731|T192731]] === 2018-04-23 === * 22:50 zhuyifei1999_: Does quarry only have an effective concurrency limit of 3, despite having a few dozen celery worker processes?! * 22:46 zhuyifei1999_: behaving abnormally. https://quarry.wmflabs.org/query/26629 has been queued for 16 mins... (hopefully) have some time to investigate === 2018-04-18 === * 23:21 framawiki: deployed {{Gerrit|02049d9}} to quarry-main-01 * 22:21 zhuyifei1999_: +Framawiki project admin & Gerrit +2 * 19:35 zhuyifei1999_: deployed {{Gerrit|c6cd55e}} to quarry-main-01 * 17:36 zhuyifei1999_: deployed {{Gerrit|8eeeff8}} to quarry-main-01 === 2018-04-17 === * 23:00 zhuyifei1999_: forgot to restart uwsgi on last deployment. restarted it now * 00:34 zhuyifei1999_: Deploy {{Gerrit|b5fd6b0}} on quarry-main-01 === 2018-03-24 === * 03:21 zhuyifei1999_: revert back to {{Gerrit|d9cc1c8}} again on quarry-runner0{1,2} [[phab:T188564|T188564]] [[phab:T190608|T190608]] === 2018-03-16 === * 00:29 zhuyifei1999_: deploying {{Gerrit|fc109c2}} to both runners [[phab:T188564|T188564]] === 2018-03-15 === * 19:27 zhuyifei1999_: switch back to {{Gerrit|d9cc1c8}} on both hosts * 16:29 zhuyifei1999_: quarry-runner-02 is on {{Gerrit|d9cc1c8}} * 16:21 zhuyifei1999_: installed python-dbg on quarry-runner-02 because it's so good * 16:18 zhuyifei1999_: depool quarry-runner-01 * 15:56 zhuyifei1999_: deploying {{Gerrit|d653400}} to quarry-runner-0{1,2} [[phab:T188564|T188564]] === 2018-03-01 === * 18:41 zhuyifei1999_: deploying {{Gerrit|d5e2845}} to quarry-runner-01 & 02 * 00:37 zhuyifei1999_: `UPDATE query join query_revision on query.latest_rev_id = query_revision.id join query_run on latest_run_id = query_run.id SET status=1 where (status = 2 or status = 1) and query_run.timestamp <= DATE_ADD(NOW(), INTERVAL -1 HOUR);` 251 rows affected (1.81 sec) [[phab:T139162|T139162]] [[phab:T172086|T172086]] [[phab:T188564|T188564]] === 2018-02-28 === * 22:57 zhuyifei1999_: killed two IO-intensive query saves === 2018-02-09 === * 01:06 bd808: Removed TestingAccount2 at user request ([[phab:T186289|T186289]]) * 01:06 bd808: Removed Yuvipanda at user request ([[phab:T186289|T186289]]) === 2018-01-02 === * 10:52 zhuyifei1999_: deploying {{Gerrit|d9cc1c8}} to quarry-runner-01 & 02 [[phab:T172143|T172143]] === 2017-12-13 === * 19:19 zhuyifei1999_: Deployed {{Gerrit|62676f2}} to quarry-main-01 and restarted uwsgi === 2017-12-10 === * 05:52 zhuyifei1999_: deployed {{Gerrit|e835a46}} to quarry-main-01 and restarted uwsgi [[phab:T165169|T165169]] * 05:49 zhuyifei1999_: quarry-main-01: `ALTER IGNORE TABLE star ADD UNIQUE INDEX star_user_query_index (user_id, query_id);` Records: 728 Duplicates: 17 Warnings: 0 [[phab:T165169|T165169]] === 2017-12-05 === * 18:56 zhuyifei1999_: quarry-main-01: `MariaDB [quarry]> UPDATE user SET username = '-revi' WHERE username = 'Hym411';` [[phab:T182064|T182064]] === 2017-10-02 === * 05:15 zhuyifei1999_: Deployed {{Gerrit|644b293}} to quarry-main-01 and restarted uwsgi === 2017-09-26 === * 04:08 zhuyifei1999_: Restarting service 'uwsgi-quarry-web' on quarry-main-01, 'celery-quarry-worker' on quarry-runner-01 & quarry-runner-02 [[phab:T176694|T176694]] * 03:59 zhuyifei1999_: Switching REPLICA_HOST from 'enwiki.labsdb' to 'enwiki.analytics.db.svc.eqiad.wmflabs' [[phab:T176694|T176694]] (Executing `sudo -- sudo -u quarry sed -i 's/enwiki.labsdb/enwiki.analytics.db.svc.eqiad.wmflabs/' /srv/quarry/quarry/config.yaml` on all hosts) === 2017-09-10 === * 16:22 zhuyifei1999_: Deployed {{Gerrit|a6173a2}} on quarry-main-01 [[phab:T175466|T175466]] === 2017-09-08 === * 18:06 zhuyifei1999_: Deployed {{Gerrit|d9e8a4a}} to quarry-main-01 [[phab:T175285|T175285]] === 2017-09-06 === * 00:05 zhuyifei1999_: backup quarry main database to /data/project/dump-2017-09-05.sql because I fear it die :(. Executing `sudo mysqldump quarry {{!}} sudo tee /data/project/dump-2017-09-05.sql > /dev/null` === 2017-08-11 === * 11:12 zhuyifei1999_: deployed {{Gerrit|2834160}} on quarry-main-01 === 2017-08-01 === * 12:51 zhuyifei1999_: Deployed {{Gerrit|ba54a61}} on quarry-main-01 [[phab:T164390|T164390]] === 2017-07-31 === * 16:31 zhuyifei1999_: Repeated for quarry-main-01, but restarted uwsgi [[phab:T146483|T146483]] * 16:30 zhuyifei1999_: Repeated for quarry-runner-02 * 16:28 zhuyifei1999_: Restarted celery-quarry-worker on quarry-runner-01 [[phab:T146483|T146483]] * 16:22 zhuyifei1999_: `zhuyifei1999@quarry-runner-01:/srv/quarry$ sudo git fetch; sudo git checkout 6447943` [[phab:T146483|T146483]] * 01:01 zhuyifei1999_: `zhuyifei1999@quarry-main-01:/srv/quarry$ sudo git fetch; sudo git checkout 7dd8c60; sudo service uwsgi restart` [[phab:T101424|T101424]] === 2017-07-30 === * 21:17 zhuyifei1999_: `sudo service uwsgi restart` [[phab:T76126|T76126]] * 21:14 zhuyifei1999_: `sudo git fetch; sudo git checkout 172eb7e` on /srv/quarry [[phab:T76126|T76126]] * 20:10 zhuyifei1999_: Yuvi gave me access after I asked about [[phab:T76126|T76126]] === 2017-06-30 === * 22:34 bd808: Added BryanDavis (self) as project admin * 18:09 bd808: Ran service uwsgi-quarry-web restart on quarry-main-01. People seeing intermittent 502s === 2017-06-26 === * 18:33 milimetric: Restarted celery workers on quarry-runner-01 and quarry-runner-02 (systemctl restart celery-quarry-worker.service) * 17:20 madhuvishy: Add milimetric as project admin === 2017-01-19 === * 12:07 yuvipanda: run chown -R 998:998 quarry/ on labstore1004 === 2016-10-05 === * 19:48 mutante: quarry-runner-01 has a problem starting exim4 * 19:47 mutante: merged gerrit 308313 - should definitely be no-op, but noticed that puppet is disabled on quarry-main-01 === 2016-10-04 === * 19:33 valhallasw`cloud: removed myself as admin === 2016-09-23 === * 20:41 yuvipanda: add halfak as projectadmin === 2016-05-08 === * 21:35 Krenair: restarted quarry-runner-01 to attempt to get the queue working again - new queries are going through but some old ones (from the last few hours) seem stuck === 2016-04-02 === * 14:06 valhallasw`cloud: systemctl restart celery-quarry-worker.service hangs. Will now reboot quarry-runner-02. * 14:02 valhallasw`cloud: killed 100% CPU using process on quarry-runner-02 (ptrace suggested some sort of idle loop). Let's see if that has any effect. === July 4 === * 15:45 YuviPanda: deploying to latest master and hoping! === April 30 === * 21:57 andrewbogott: moving quarry-main-01 to labvirt1003 * 20:53 andrewbogott: moving quarry-runner-01 to labvirt1004 * 19:41 andrewbogott: cold-migrating quarry-runner-test to labvirt1003 === December 8 === * 19:36 YuviPanda: deploying to get in valhallasw`cloud’s patches === September 30 === * 15:44 andrewbogott: enabled puppet on quarry-runner-test, updated, installed a bunch of maria stuff, rebooted === September 16 === * 21:53 YuviPanda: appllying db, web and redis roles to quarry-runner-test, will act as db and web host until labs issues clear up === August 21 === * 19:47 YuviPanda: upgraded all text and varchar columns to utf8 === August 17 === * 02:03 YuviPanda: increased mysql connection limit manually to 1024, re-running all old query-runs to produce output in new sqlite format === August 14 === * 20:21 YuviPanda: upgrade to MariaDB 10.1, because fuck offline ALTER TABLEs {{SAL|Project Name=quarry}} <noinclude>[[Category:SAL]]</noinclude> hw5b3guj973onsefc1ab1rjj9495uie Release Engineering/SAL 0 17290 2309631 2309516 2025-06-08T18:04:09Z Stashbot 7414 James_F: Zuul: [mediawiki/extensions/SemanticVersion] Add basic CI 2309631 wikitext text/x-wiki == 2025-06-08 == * 18:04 James_F: Zuul: [mediawiki/extensions/SemanticVersion] Add basic CI == 2025-06-06 == * 14:37 jnuche: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/79 == 2025-06-05 == * 23:21 thcipriani: update scap in beta to 4.171.0 to match prod * 20:44 James_F: Zuul: [wikimedia-ui-base] Sunset WikimediaUI Base, archive repo's CI, for [[phab:T354310|T354310]] * 20:20 bd808: Added `profile::memcached::firewall_src_sets: ~` to deployment-memc prefix puppet ([[phab:T396109|T396109]]) * 19:03 Krinkle: Update profile::tlsproxy::envoy::cfssl_options under deployment-mediawiki in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. ref [[phab:T289318|T289318]] * 18:26 James_F: Docker: Re-build PHP images with php-uuid (and incidentally bump versions), for [[phab:T373752|T373752]] * 17:14 James_F: Docker: [mediawiki-phan-testrun] Migrate parent image from php74 to php81 * 17:10 James_F: Docker: [phpmetrics] Migrate parent image from php74 to php81 * 17:10 James_F: Where will Abstract Content go? * 17:07 James_F: Zuul: [mediawiki/extensions/WikimediaMaintenance] Add dependencies, for [[phab:T58074|T58074]] * 16:39 James_F: Zuul: [mediawiki/tools/phan/PerfCheckPlugin] Use a template for CI * 16:37 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Stop testing in PHP 7.4 * 16:36 James_F: Zuul: [labs/tools/heritage] Raise PHP testing from 7.4 to 8.1 * 16:34 James_F: Zuul: Stop testing most libraries and tools in PHP 7.4 * 16:28 James_F: Zuul: Stop testing PHP extensions with PHP 7.4 * 16:26 James_F: Zuul: [integration/quibble] Stop testing in PHP 7.4, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 16:23 James_F: Zuul: [mediawiki/services/parsoid] Stop testing in PHP 7.4 * 16:21 James_F: Zuul: [operations/mediawiki-config] Stop testing in PHP 7.4 * 16:09 James_F: Zuul: Drop all PHP 7.4 testing for MediaWiki things, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 04:46 Krinkle: gitpuppet@deployment-puppetserver-1:/srv/git/operations/puppet$ Cherry-pick https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153764, ref [[phab:T289318|T289318]] * 03:58 Krinkle: Update profile::cache::haproxy::available_unified_certificates under deployment-cache in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. Remove `*.zero.wikipedia.beta.wmflabs.org` which wasn't responding/didn't work anymore. ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there), ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there) * 00:32 Krinkle: Add `TXT *.wikimedia.beta.wmcloud.org. "v=spf1 -all"` to match beta.wmflabs.org DNS (ref [[phab:T289318|T289318]], changing email is out of scope for now, but might as well add the DNS records). * 00:22 Krinkle: Adding missing DNS entries under beta.wmcloud.org. There was already: *.wikipedia, *.m.wikimedia, *.wikivoyage, *.m.wikivoyage (for [[phab:T355281|T355281]]). Adding: wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary, wikidata, upload ([[phab:T289318|T289318]]). == 2025-06-04 == * 21:27 James_F: Zuul: [mediawiki/extensions/Springboard] Add basic CI, for [[phab:T395981|T395981]] * 12:10 lucaswerkmeister: lucaswerkmeister@deployment-deploy04:~$ mwscript createAndPromote commonswiki --interface-admin --force 'Lucas Werkmeister' # w-beta.wmflabs.org/mt == 2025-06-03 == * 23:59 James_F: Zuul: [mediawiki/services/<some>] Upgrade test suite to Node 24 & 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikimedia/portals] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikipeg] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:55 James_F: Zuul: [oojs/*i] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:53 James_F: Zuul: [wikimedia/portals/deploy] Drop tests, this repo isn't testable * 23:20 James_F: Zuul: Provide experimental Node 24 jobs where Node 22 ones exist, for [[phab:T395926|T395926]] * 17:09 bd808: Forced puppet run on deployment-webperf21 to pick up Kafka config changes ([[phab:T391273|T391273]]) * 17:08 bd808: Manually expanded (duplicated) jumbo-eqiad and main-eqiad aliases in kafka_clusters hiera config ([[phab:T391273|T391273]]) * 17:04 bd808: Added jumbo-eqiad and main-eqiad aliases to kafka_clusters hiera config ([[phab:T391273|T391273]]) * 16:00 James_F: Docker: Provide initial Node 24 images, for [[phab:T395923|T395923]] * 09:53 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo service varnish-frontend restart` for [[phab:T395808|T395808]] * 09:52 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo -i puppet agent -tv` for [[phab:T395808|T395808]] == 2025-06-02 == * 14:37 James_F: Zuul: Add Matrix to CI allowlist * 14:37 James_F: Zuul: [operations/software/gerrit/plugins/events-wikimedia] mark as archived, for [[phab:T304947|T304947]] * 14:36 James_F: Zuul: [mediawiki/extensions/CookieConsent] Add basic CI * 13:45 hashar: Updating Jenkins jobs for "drop obsolete creation of log & src dirs" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1152702 == 2025-05-30 == * 22:16 thcipriani: killed 1000s of zuul merger jobs via https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Very_high_queue_of_merger:merge_functions for parsoid, wikibase, and core * 21:20 bd808: Poked hole in blocked_nets for 188.214.8.0/21 ([[phab:T395709|T395709]]) * 09:43 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 57273 and 57274 == 2025-05-29 == * 22:18 bd808: Submitted WikimediaDebug v3.1.0 to addons.mozilla.org for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) * 22:12 bd808: Submitted WikimediaDebug v3.1.0 to Chrome Web Store for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) == 2025-05-28 == * 20:27 James_F: Zuul: [mediawiki/extensions/ArticleSummaries] Promote to Wikimedia production, for [[phab:T393940|T393940]] * 13:15 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='en_rtlwiki'; and DELETE FROM localnames WHERE ln_wiki='en_rtlwiki'; as part of closing the wiki * 12:30 James_F: Zuul: Add an explanatory note to bluespice template that we skip non-LTSes == 2025-05-24 == * 21:52 Krinkle: Disable publishing notifs on Phab tasks from extension-Chart mirror, [[phab:T143162|T143162]], [[phab:T272803|T272803]] == 2025-05-23 == * 18:36 James_F: Zuul: [mediawiki/core] Restore node testing for release branches, for [[phab:T395141|T395141]] * 17:55 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1149705 == 2025-05-22 == * 21:15 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-upload08 to pick up new config ([[phab:T393404|T393404]]) * 21:12 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-text08 to pick up new config ([[phab:T393404|T393404]]) * 21:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143602 ([[phab:T393404|T393404]]) * 21:09 bd808: Added `block_help: "see https://wikitech.wikimedia.org/wiki/Beta/Blocked_help for more information."` under `profile::cache::varnish::frontend::fe_vcl_config` in both deployment-cache-text and deployment-cache-upload Prefix Puppet ([[phab:T393404|T393404]]) * 20:11 brennen: devtools: phorge: test deploying work/merge-phorge-2024.35 changes * 17:25 bd808: `./jjb-update 'selenium-daily-beta*-MediaWiki'` to deploy updates to selenium-daily-beta-MediaWiki and selenium-daily-betacommons-MediaWiki failure notifications ([[phab:T394551|T394551]]) * 14:45 dancy: Upgrade gitlab-runner to v17.10.1 in gitlab-cloud-runner (staging and production) [[phab:T394953|T394953]] * 11:39 hashar: Triggered replication of mediawiki/extensions/BlueSpiceSmartlist and mediawiki/extensions/BlueSpiceSmartList to fix https://github.com/wikimedia/mediawiki-extensions-BlueSpiceSmartlist {{!}} [[phab:T394903|T394903]] * 11:37 hashar: gerrit: changed parent of mediawiki/extensions/BlueSpiceSmartlist (lower case L) to All-Archived-Projects to prevent it from being replicated to GitHub {{!}} [[phab:T394903|T394903]] == 2025-05-21 == * 07:24 hashar: restarted Gerrit on gerrit1003 * 07:18 hashar: restarted Jenkins on contint1002 == 2025-05-20 == * 17:51 bd808: Open CDN edge blocks to allow traffic from 190.217.20.32/28 * 17:13 dancy: Restarting Jenkins on contint1002 * 16:27 James_F: Docker: [quibble-bullseye-php81-coverage]: Fix clover-edit for py39 * 14:30 James_F: Docker: [quibble-bullseye-php74-coverage] Bump phpunit-patch-coverage to 0.0.15 * 14:28 hashar: integration: cleared Docker build cache on integration-agent-docker-1052 and integration-agent-docker-1061 * 13:49 James_F: Docker: Provide quibble-bullseye-php81-coverage == 2025-05-19 == * 15:48 James_F: Zuul: Switch primary master branch testing to PHP 8.1, not 7.4 * 15:45 James_F: Zuul: Switch / remove any experimental testing to PHP 8.1, not 7.4 * 15:39 James_F: Zuul: Switch REL1_39 branch testing to PHP 8.1, not 7.4 * 15:37 James_F: Zuul: Switch all wmf branch testing to PHP 8.1, not 7.4 * 13:25 James_F: Zuul: Simplify the regular Quibble job name to drop 'noselenium' * 13:24 James_F: jjb: Simplify the regular Quibble job name to drop 'noselenium' * 12:18 hashar: integration: cleaned Docker build cache on integration-agent-docker-1045 * 09:26 hashar: integration: cleaned Docker build cache on integration-agent-docker-1040 == 2025-05-16 == * 16:57 James_F: Zuul: Split Quibble jobs into selenium-only and non-selenium for skins == 2025-05-15 == * 21:22 bd808: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146722 * 13:54 James_F: Zuul: [mediawiki/extensions/Realnames] Use vendor quibble, not composer * 09:34 codders: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146520 == 2025-05-14 == * 21:31 bd808: Restarted varnish-frontend on deployment-cache-text08 to pick up blocked_nets changes ([[phab:T394311|T394311]]) * 16:06 hashar: Updating jobs for "jjb: silence some shell blocks in macro-docker.yaml" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145090 {{!}} [[phab:T393847|T393847]] * 13:43 hashar: Reloded Zuul for Zuul: [mediawiki/extensions/Wikibase] Enable Open Search for apitests jobs {{!}} https://gerrit.wikimedia.org/r/1145331 {{!}} [[phab:T386691|T386691]] == 2025-05-13 == * 19:27 James_F: Zuul: Upgrade all Quibble 'apitests' jobs from 7.4 to 8.1, for [[phab:T386691|T386691]], [[phab:T328921|T328921]], [[phab:T328922|T328922]] * 12:35 dcausse: deployment-prep: reindexing wikidata to pickup the "mul" language field ([[phab:T392058|T392058]]) * 08:23 hashar: Update jobs to mute checks for npm packaging files {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145087/ {{!}} [[phab:T393847|T393847]] == 2025-05-12 == * 16:48 hashar: Updated Jenkins jobs to silence git in ci-src-setup (take 2) {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 16:46 bd808: Reenabled beta-scap-sync-world and beta-update-databases-eqiad Jenkins jobs * 15:55 hashar: Updated Jenkins jobs to silence git in ci-src-setup {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 15:50 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud. Attempting to fix a "Found non-revoked Puppet certificates for 1 deleted instances" Prometheus alert. * 15:28 bd808: Forced puppet run on deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:28 bd808: Forced puppet run on deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:22 bd808: Added `prometheus::instances` and `prometheus::instances_defaults` hiera settings to "deployment-etcd" Prefix Puppet via Horizon ([[phab:T393866|T393866]]) * 12:30 Krinkle: Disable publishing noise from rWSWF, [[phab:T143162|T143162]], [[phab:T267223|T267223]] * 09:52 hashar: Updating all jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1143972 "Omit noisy `ls` debugging commands when not needed" # [[phab:T282893|T282893]] & [[phab:T393847|T393847]] * 08:28 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] * 08:15 hashar: Updated jobs for "Replace all uses of `$(pwd)` with `$PWD`" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1143967/ * 07:58 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] == 2025-05-08 == * 20:28 dancy: Updating buildkitd to v0.21.1 in gitlab-cloud-runners * 10:58 James_F: Zuul: Support capital first letter of e-mail for Aeywoo in allow list == 2025-05-07 == * 08:52 hashar: Updating Jenkins jobs to Quibble 1.14.1 * 07:03 hashar: Hard rebooted integration-agent-docker-1061 via Horizon, the instance is not reachable by ssh and looks bricked # [[phab:T393542|T393542]] * 06:58 hashar: Change ssh credentials for integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 to `key to connect to labs instances set up with role::ci::slave::labs::common` # [[phab:T393543|T393543]] * 06:57 hashar: Added label `blubber` and `pipelinelib` to integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 # [[phab:T393543|T393543]] * 06:41 hashar: integration: bring back integration-agent-docker-1062 , I had it disconnected on April 30 at 6:30am UTC to clean /srv/jenkins/workspace and apparently forgot to put it back online == 2025-05-06 == * 16:16 hashar: restarting CI Jenkins due to a deadlock affecting castor-save-workspace which ends up blocking jobs # [[phab:T353925|T353925]] * 15:06 hashar: Tag Quibble 1.4.1 @ {{Gerrit|5247438621f802ba9878970b3b34b2d67cefa54c}} == 2025-05-05 == * 14:32 hashar: contint1002 and contint2002: deleted /srv/docker/buildkit following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 13:50 hashar: contint1002 and contint2002: deleted /srv/docker/image/overlay2 following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 09:45 hashar: Cleared /srv/docker/overlay2 on contint2002 * 09:41 hashar: Cleared /srv/docker/overlay2 on contint1002 (it had bunch of old layers from April/May 2024) == 2025-05-04 == * 13:10 hashar: contint1002: deleted old videos from /srv/jenkins/builds * 08:59 James_F: Zuul: [AbuseFilter] Add CommunityConfiguration as a Phan dependency, for [[phab:T393240|T393240]] * 06:33 James_F: Zuul: [mediawiki/extensions/PageImages] Add Scribunto phan dependency, for [[phab:T131911|T131911]] * 06:33 James_F: Zuul: [mediawiki/extensions/WikimediaEvents] Add CLDR dependency == 2025-05-03 == * 10:28 James_F: Zuul: [mediawiki/extensions/PageAssessments] Add Scribunto phan dependency, for [[phab:T380122|T380122]] == 2025-05-02 == * 17:39 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add Echo as a phan dep * 12:30 James_F: Zuul: [mediawiki/extensions/CodeEditor] Add BetaFeatures phan dependency, for [[phab:T373711|T373711]] * 12:18 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst voting again * 08:43 hashar: Updating Quibble jobs to 1.14.0 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1140215 {{!}} [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 07:00 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as full CI dep too, for [[phab:T391230|T391230]] * 06:52 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as phan dependency, for [[phab:T391230|T391230]] == 2025-04-30 == * 23:46 dancy: Re-enabled https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ * 18:54 dancy: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad while Gerrit is down. * 15:50 hashar: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1140203 * 15:01 hashar: Tagged Quibble 1.14.0 @ {{Gerrit|6d7c736d12daa7ea23b261ede02093f8fe7a83ae}} # [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 06:30 hashar: integration: cleared /srv/jenkins/workspace on integration-agent-docker-1062 == 2025-04-29 == * 21:04 mutante: integration-agent-docker-1051.integration - killall -9 ffmpeg - [[phab:T392963|T392963]] * 20:28 mutante: integration-agent-docker-1048.integration - killall -9 ffpmeg - [[phab:T392963|T392963]] == 2025-04-28 == * 19:01 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1139536 * 15:49 dancy: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/76 * 13:05 James_F: Docker: Bump Node20 and Node22 binaries to latest and cascade == 2025-04-26 == * 00:05 bd808: Punched a hole in the beta cluster network blocks to allow 38.242.176.0/22 through. == 2025-04-24 == * 19:54 thcipriani: deployment-cache-text08: systemctl reload varnish-frontend following puppet run change to /etc/varnish/blocked-nets.inc.vcl * 19:49 thcipriani: deployment-cache-text08: sudo puppet-run to pick up https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/42c7880be27913c9e841642d9ff3e50deb455e08 * 15:32 bd808: Punched a hole in the beta cluster network blocks to allow 47.144.0.0/12 through. ([[phab:T392534|T392534]]) * 14:41 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (production) * 14:34 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (staging) == 2025-04-23 == * 22:59 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:43 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:15 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up a huge pile of new blocks ([[phab:T392534|T392534]]) * 22:11 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Switch Node 20 CI on, for [[phab:T382177|T382177]] * 21:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 21:29 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 20:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 17:43 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Disable CI for now, for [[phab:T382177|T382177]] * 16:57 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/commit/a80e5211100f1cc42e4ae020d4266ea22938eb5a ([[phab:T383097|T383097]]) * 14:25 James_F: Zuul: [wikimedia/portals] Switch to Node 20, for [[phab:T382179|T382179]] == 2025-04-17 == * 10:15 hashar: gerrit: reparented apps.git to All-Archived-Projects.git in order to BLOCK `mediawiki-replication`. I have also archived all subprojects # [[phab:T392198|T392198]] == 2025-04-16 == * 20:59 bd808: Blocked 193.43.72.0/24 and 14.160.0.0/11 because beta was very, very sad * 16:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst non-voting for now * 09:20 hashar: integration: restarted integration-puppetserver-01 == 2025-04-15 == * 22:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst job voting, for [[phab:T368002|T368002]] * 19:40 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392003|T392003]]) * 18:11 bd808: `bd808@deployment-cache-text08:~$ sudo service varnish-frontend restart` ([[phab:T392003|T392003]]) * 18:06 bd808: `sudo puppet agent -tv` on deployment-cache-text08 to update varnish deny list ([[phab:T392003|T392003]]) * 17:30 bd808: `shutdown -r now` on deployment-mediawiki14. Load has been growing for ~2 days. == 2025-04-11 == * 19:53 James_F: Zuul: [oojs/router] Mark as archived, for [[phab:T391709|T391709]] * 14:00 hashar: restarted integration-puppetserver: jvm went out of memory == 2025-04-10 == * 23:40 bd808: Removed wikifunctions from deployment-cache prefix puppet's profile::cache::haproxy::available_unified_certificates::server_names. https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/6af09ceaa6d261c910fb4b42d7b3e8b8172c8041%5E%21/ * 23:36 bd808: Deleted m.wikifunctions.beta.wmflabs.org, *.wikifunctions.beta.wmflabs.org, and wikifunctions.beta.wmflabs.org DNS records per [[Special:Diff/2292116]]. All 3 were pointing to 185.15.56.36. * 14:16 hashar: deployment-prep: `profile::mediawiki::php::increase_open_files: True` on https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-mediawiki # [[phab:T389422|T389422]] * 14:03 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='wikifunctionswiki'; and DELETE FROM localnames WHERE ln_wiki='wikifunctionswiki'; for [[phab:T391511|T391511]] == 2025-04-08 == * 22:39 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135128 * 22:15 bd808: Manually deleted 'deployment-wikikube-v127' Magnum cluster template via Horizon. Deletion via OpenTofu has timed out repeatedly. * 22:08 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135123 * 22:02 brennen: Updating docker-pkg files on contint primary for [[phab:T383065|T383065]] * 21:11 James_F: Beta Cluster: Shutting of deployment-docker-wikifunctions01, we decom'ing it. * 20:44 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1135098 == 2025-04-07 == * 17:20 bd808: `service navtiming stop` to halt "Unhandled exception in main loop, restarting consumer" crash loop ([[phab:T391272|T391272]]) * 17:15 bd808: Reboot deployment-webperf21 ([[phab:T391272|T391272]]) * 16:58 bd808: `puppet agent -tv` to catch up with missed puppet runs on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:56 bd808: `rm /var/log/user.log.1` on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:47 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1 to clean up dangling certs for deployment-elastic<nowiki>{</nowiki>09,10,11<nowiki>}</nowiki> == 2025-04-04 == * 09:42 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 35782 and 35784 * 09:09 hashar: Update tox jobs to default to python 3.9 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1134168 * 08:53 hashar: Updating Quibble jobs to catch up with latest image https://gerrit.wikimedia.org/r/c/integration/config/+/1134167 {{!}} [[phab:T3666646|T3666646]] * 00:35 thcipriani: integration-agent-docker-1041 marked offline due to /srv disk space * 00:09 Krinkle: Disable duplicate publishing noise from extension-MediaUploader, ref [[phab:T143162|T143162]], [[phab:T389450|T389450]] == 2025-04-03 == * 15:06 James_F: Zuul: Configure the REL1_44 test and gate pipelines, for [[phab:T390695|T390695]] * 13:33 James_F: Docker: [quibble-bullseye] Revert MardiaDB to 10.5, for (against) [[phab:T366646|T366646]] * 13:08 James_F: Zuul: [mediawiki/extensions/MetricsPlatform] Publish JS docs == 2025-04-02 == * 13:39 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133383 [[phab:T390754|T390754]] * 12:36 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133379 https://gerrit.wikimedia.org/r/1133380 * 12:20 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133373 == 2025-04-01 == * 20:46 James_F: Zuul: Swap the branch check to specific release branches, for [[phab:T390754|T390754]] etc. * 20:34 James_F: Docker: [quibble-bullseye] Switch MariaDB to 10.6 Wikimedia package, for [[phab:T366646|T366646]] * 20:26 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133238 * 20:09 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133231 * 19:31 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133221 [[phab:T390754|T390754]] * 18:40 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133209 [[phab:T390772|T390772]] * 16:53 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133184 [[phab:T390754|T390754]] == 2025-03-31 == * 18:26 dancy: Reloading Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1132688 * 15:20 James_F: Zuul: [mediawiki/extensions/EmailAuth] Mark as in Wikimedia production, move up, for [[phab:T390437|T390437]] * 11:08 dcausse: [[phab:T389971|T389971]]: deleting deployment-elastic* VMs in deployment-prep * 08:24 dcausse: [[phab:T389971|T389971]]: shutting down deployment-elastic* VMs in deployment-prep == 2025-03-28 == * 22:01 Krinkle: Disable duplicate publishing noise from extension-LoginNotify, ref [[phab:T143162|T143162]], [[phab:T390315|T390315]] * 21:39 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 * 21:15 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 == 2025-03-27 == * 16:28 bd808: Moved Puppet configuration from deployment-cache-text08 to deployment-cache-text prefix Puppet * 16:05 bd808: `sudo systemctl restart varnish-frontend` on deployment-cache-text08 ([[phab:T390209|T390209]]) * 15:05 bd808: Moved role::acme_chief::cloud from individual instance config to deployment-acme-chief Puppet prefix. * 00:55 bd808: Removed prefix puppet classes for deployment-acme-chief ([[phab:T390128|T390128]]) == 2025-03-26 == * 20:23 inflatador: bking@deployment-prep populating new OpenSearch cluster indices a la https://wikitech.wikimedia.org/w/index.php?title=Search&oldid=2164435#Adding_new_wikis [[phab:T389971|T389971]] * 17:10 inflatador: bking@deployment-prep reverted an accident replacement of deployment-acme-chief.yaml [[phab:T389971|T389971]] * 15:04 dancy: Update gitlab-runners to v17.8.4 in gitlab-cloud-runners staging and production. * 00:30 bd808: Delete parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud service name again ([[phab:T389252|T389252]]) == 2025-03-25 == * 21:11 jeena: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130722 * 04:18 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1130729 == 2025-03-24 == * 19:35 hashar: Updating Jenkins jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1130700 == 2025-03-23 == * 18:41 James_F: Zuul: Add 0xDeadbeef to CI allowlist * 18:34 James_F: Zuul: [operations/debs/bdsync] Mark as archived, for [[phab:T377882|T377882]] * 18:31 James_F: Zuul: [mediawiki/extensions/CheckUser] Add GrowthExperiments dependency, for [[phab:T386435|T386435]] * 18:29 James_F: Zuul: [mediawiki/extensions/CategoryWatch] Add Echo CI dependency == 2025-03-20 == * 23:31 bd808: integration: thcipriani added integration-agent-docker-106<nowiki>{</nowiki>0,1,2<nowiki>}</nowiki> earlier today ([[phab:T389554|T389554]]) * 22:50 brennen: integration: added jenkins nodes for integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> with 3 executors per each ([[phab:T389554|T389554]]) * 21:41 brennen: integration: launched integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> ([[phab:T389554|T389554]]) * 21:25 eileen: civicrm upgraded from {{Gerrit|7b532ad7}} to {{Gerrit|fba4c3d6}} * 15:13 dancy: Rebooting integration-agent-docker-1046 (Seems to be be inaccessible since February) * 08:28 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1129765 == 2025-03-19 == * 20:32 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1129364 * 00:12 bd808: Trying the simplest thing that might work by adding a CNAME record for parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. ([[phab:T389252|T389252]]) == 2025-03-18 == * 20:25 bd808: Rebooting deployment-jobrunner05 because things just seem weird ([[phab:T387631|T387631]], [[phab:T387276|T387276]]) * 15:18 sergi0: run CommunityUpdates config schema migration `foreachwikiindblist growthexperiments extensions/CommunityConfiguration/maintenance/migrateConfig.php CommunityUpdates` ([[phab:T387737|T387737]]) == 2025-03-14 == * 21:36 Reedy: deployed https://gerrit.wikimedia.org/r/1127982 * 16:55 Lucas_WMDE: manually killed job https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/2928/console which had been stuck since 16:33 UTC, blocking gate-and-submit :( == 2025-03-13 == * 21:29 dancy: Finished gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 20:42 dancy: Finished gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) * 20:09 dancy: Starting gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 19:26 dancy: Starting gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) == 2025-03-11 == * 22:54 bd808: Deleted unattached volumes: alert01, db09, deploy03, mwmaint, ores02, parsoid14-srv, prometheus05 * 22:39 bd808: Released unused floating IPs 185.15.56.9 and 185.15.56.97 back to global pool * 22:08 bd808: Updated mail.beta.wmflabs.org service name to point to 185.15.56.115 * 22:04 bd808: Deleted orphan parsoid-external-ci-access.beta.wmflabs.org. DNS record * 21:53 bd808: Deleted dangling prometheus-beta.wmcloud.org web proxy * 21:50 bd808: Deleted dangling w-beta.wmflabs.org web proxy * 21:42 bd808: Deleted unused "deployment-parsoid" Prefix Puppet configuration * 20:48 James_F: Docker: [quibble-bullseye-php81 & php81] Use PCRE2 backport from component/php81, for [[phab:T386006|T386006]] * 13:19 James_F: Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived, for [[phab:T382069|T382069]] * 03:54 eileen: civicrm upgraded from {{Gerrit|f2222fcd}} to {{Gerrit|ec20a105}} == 2025-03-10 == * 15:20 James_F: Zuul: [mediawiki/services/servicelib-node] Mark as archived, for [[phab:T388424|T388424]] * 13:47 hashar: gerrit: removed leftover empty directory `/srv/gerrit/plugins/lfs`. Data have been migrated to `/srv/gerrit/plugins/lfs` as part of moving Gerrit data out of `/`. See [[phab:T333143|T333143]] == 2025-03-08 == * 01:22 James_F: Zuul: [php-session-serializer] Enable PHP 8.4 as voting, for [[phab:T368270|T368270]] == 2025-03-07 == * 21:00 James_F: Zuul: [mediawiki/libs/Shellbox] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:53 James_F: Zuul: [wikipeg] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:07 James_F: Zuul: [mediawiki/libs/Equivset] Enable PHP 8.4 as voting, for [[phab:T387806|T387806]] == 2025-03-05 == * 00:21 dancy: Reeanbled beta-scap-sync-world ([[phab:T166010|T166010]]) == 2025-03-04 == * 23:26 dancy: Disabling beta-scap-sync-world for noise reduction while dealing with [[phab:T166010|T166010]] * 22:10 James_F: Zuul: [mediawiki/services/example-node-api] Mark as archived, for [[phab:T387933|T387933]] * 01:42 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Disable on PHP 8.4, for [[phab:T386570|T386570]] * 01:13 James_F: Zuul: Add WgevaertWikiBase to CI allowlist * 01:03 James_F: Zuul: Start testing in PHP 8.4 for 'mediawiki-php-library' repos, for [[phab:T386108|T386108]] == 2025-02-28 == * 18:20 dancy: Upgrading gitlab-runner to v17.7.1 in production gitlab-cloud-runners ([[phab:T386297|T386297]]) * 18:12 dancy: Upgrading gitlab-runner to v17.7.1 in staging gitlab-cloud-runners ([[phab:T386297|T386297]]) * 17:52 dancy: Upgraded scap to 4.138.0 in beta cluster * 16:43 bd808: Deleted now dangling parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. DNS record ([[phab:T385849|T385849]]) * 16:40 bd808: Deleted deployment-parsoid14.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:39 bd808: Deleted parsoid-external-ci-access.wmcloud.org proxy ([[phab:T385849|T385849]]) * 16:37 bd808: Deleted deployment-alert01.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:36 bd808: Deleted deployment-bastion.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) == 2025-02-27 == * 01:11 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1123063 [[phab:T386476|T386476]] == 2025-02-26 == * 20:21 James_F: jforrester@doc1003:~$ sudo -u doc-uploader rm -rf /srv/doc/cover-extensions/LdapAuthentication/ #[[phab:T376097|T376097]] * 20:18 James_F: Zuul: [mediawiki/extensions/LdapAuthentication] Mark as archived, for [[phab:T376097|T376097]] * 13:20 hashar: Updating Quibble jobs to 1.13.0. "Skip execution upon a success cache hit" which would make some jobs to skip tests entirely when a set of commits/image is known to have previously passed # [[phab:T383243|T383243]] {{!}} dduvall * 11:06 hashar: Tag Quibble 1.13.0 @ {{Gerrit|0ac128f7bc060c82f11317aabaf78a10b24aeeec}} # [[phab:T383243|T383243]] * 09:11 hashar: deployment-prep: cherry picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122901 "php: use component/pcre2 when using Php 8.1" to fix php # [[phab:T387276|T387276]] * 01:55 bd808: `./jjb-update 'integration-quibble-fullrun-*-php81' '*-php81-phan' '*php81*'` * 01:16 Reedy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1122700 [[phab:T386006|T386006]] == 2025-02-25 == * 20:25 James_F: Docker: [php81] Update PHP to 8.1.31-1+wmf11u4, for [[phab:T386006|T386006]] * 14:07 James_F: Docker: [php81] Upgrade Wikimedia's PHP to 8.1.31-1+wmf11u3 & PCRE to 10.42 for [[phab:T386006|T386006]] == 2025-02-24 == * 01:02 jeena: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/73 == 2025-02-22 == * 11:27 taavi: rebooting integration-agent-docker-1047 which thinks it is gerrit == 2025-02-21 == * 22:54 brennen: gitlab: removing expiration date for 14 tokens expiring in 2025 ([[phab:T385930|T385930]]) * 22:36 brennen: gitlab: set require_personal_access_token_expiry and service_access_tokens_expiration_enforced to false == 2025-02-20 == * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners ([[phab:T386955|T386955]]) * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners == 2025-02-19 == * 21:28 dancy: Reenabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ ([[phab:T386851|T386851]]) * 19:35 dduvall: restarting jenkins to fix git related issues following java update ([[phab:T386755|T386755]]) * 15:47 dancy: Disabled the https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ job to reduce noise while the problem is being debugged. == 2025-02-18 == * 16:49 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1119815 * 16:11 James_F: Zuul: [operations/debs/dnsdist] Revert archival == 2025-02-13 == * 13:57 James_F: Zuul: [mediawiki/extensions/CirrusSearch] Drop WikibaseCirrusSearch dep, for [[phab:T386015|T386015]] == 2025-02-12 == * 17:22 James_F: Zuul: Add User:Michi j to CI allowlist * 17:21 James_F: Zuul: Add Dragoniez to CI allowlist == 2025-02-11 == * 15:43 James_F: Zuul: Make PHP 8.4 voting on lib repos where it already passes, for [[phab:T386108|T386108]] == 2025-02-10 == * 14:27 James_F: Zuul: Add Bunnypranav to CI allowlist == 2025-02-08 == * 00:07 bd808: Added `profile::maps::osm_master::disable_waterlines_import_timer: false` to deployment-maps prefix hiera ([[phab:T385921|T385921]]) == 2025-02-07 == * 22:14 brennen: phab/phorge: replaced mr-widget token in deployed config ([[phab:T385480|T385480]]) * 21:33 bd808: Added `profile::restbase::parsoid_uri: https://phabricator.wikimedia.org/T385902` to deployment-restbase prefix puppet ([[phab:T385902|T385902]]) * 01:34 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117997 to deployment-puppetmaster ([[phab:T385849|T385849]]) * 00:42 bd808: Shutoff deployment-parsoid14 to see if anything breaks/anyone yells ([[phab:T385849|T385849]]) == 2025-02-06 == * 23:53 bd808: Updated citoid-beta.wmflabs.org to point to deployment-docker-citoid02 * 23:50 bd808: Deleted beta-prometheus.wmflabs.org; it was pointed to an IP now owned by the mdwikioffline project. * 23:43 bd808: Deleted recently orphaned spiderpig.wmcloud.org proxy after discussion with dancy * 16:20 bd808: Rebooted deployment-sessionstore06 ([[phab:T385803|T385803]]) * 12:07 andrewbogott: rebooting all servers for [[phab:T385264|T385264]] == 2025-02-05 == * 19:17 James_F: Zuul: [mediawiki/extensions/DonationInterface] Switch CI from PHP74 to PHP82 * 18:23 James_F: Zuul: [mediawiki/extensions/cldr] Raise FR-special job to REL1_43 * 18:22 James_F: Zuul: [mediawiki/extensions/DonationInterface] Raise FR-special job to REL1_43 * 18:11 James_F: Zuul: [labs/tools/heritage] Fold template into this, only user * 18:08 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Test in PHP 8.2+ only * 17:29 James_F: Zuul: [mediawiki/core] Test fundraising branches against PHP 8.2 * 17:19 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Mark as non-prod == 2025-02-03 == * 12:34 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115782 == 2025-01-30 == * 15:12 James_F: Zuul: [mediawiki/extensions/Wikibase] Only inject EntitySchema on 1.43+, for [[phab:T385175|T385175]] * 01:39 James_F: Zuul: [mediawiki/core] Remove composer variant from wmf branches * 00:42 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115131 == 2025-01-29 == * 18:03 James_F: Zuul: Make FR REL1_43-php82 voting for cldr and FEU * 17:54 James_F: Zuul: Add FR REL1_43-php82 as experimental to other extensions * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Add FR REL1_43-php82 as experimental * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Re-enable FR-tech job as voting, passes fine * 16:57 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115064 * 16:33 hashar: gerrit: marked all legacy Puppet modules as read-only ( https://gerrit.wikimedia.org/r/admin/repos/q/filter:operations/puppet/ ) and removed the associated GitHub mirrors that existed for some of them == 2025-01-28 == * 17:46 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1113550 ([[phab:T383337|T383337]]) * 17:38 dancy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1113549 ([[phab:T383337|T383337]]) * 10:07 hashar: Manually cleaned integration-agent-docker-1043 == 2025-01-27 == * 18:17 hashar: Cleaned disk on integration-agent-docker-1051 == 2025-01-25 == * 09:20 taavi: reloading zuul for https://gerrit.wikimedia.org/r/1113739 == 2025-01-24 == * 21:44 James_F: Revert "Zuul: Switch Fundraising jobs to REL1_43" == 2025-01-23 == * 16:31 dancy: Updating production gitlab-cloud-runners to v17.6.1 * 16:23 dancy: Updating staging gitlab-cloud-runners to v17.6.1 == 2025-01-22 == * 18:14 James_F: Zuul: [mediawiki/extensions/WikiLambda] Add Wikibase as a phan dependency == 2025-01-20 == * 09:55 hashar: Updating Quibble jobs to enable success cache experiment - [[phab:T383243|T383243]] * 08:20 hashar: Updating all Jenkins jobs to update Quibble to 1.12.0 == 2025-01-17 == * 16:59 dduvall: Building Docker images for Quibble 1.12.0 * 15:00 hashar: Building Docker images for Quibble 1.12.0 * 12:56 hashar: Tag Quibble 1.12.0 @ {{Gerrit|633099ead3ec72180e7890e1980074b4fde56c26}} # [[phab:T365978|T365978]], [[phab:T383243|T383243]] == 2025-01-14 == * 17:14 brennen: integration project: create integration-agent-docker-1059 for [[phab:T383254|T383254]] * 16:50 brennen: integration project: create integration-agent-docker-1058 for [[phab:T383254|T383254]] == 2025-01-10 == * 15:55 dancy: Updating gitlab-cloud-runners (prod) to v17.5.5 ([[phab:T383263|T383263]]) * 15:49 dancy: Updating gitlab-cloud-runners (staging) to v17.5.5 == 2025-01-09 == * 22:20 brennen: gitlab: Feature.enable(:kubernetes_agent_protected_branches) - https://docs.gitlab.com/ee/user/clusters/agent/ci_cd_workflow.html#restrict-access-to-the-agent-to-protected-branches * 18:08 James_F: Docker: [node22] Update Node to v22.13.0, & switch base image to bookworm, for [[phab:T383337|T383337]] * 17:01 James_F: Docker: [node20] Update Node to v20.18.1, & switch base image to bookworm, for [[phab:T383337|T383337]] * 15:13 James_F: Docker: [sury-php] Re-platform to bookworm == 2025-01-08 == * 22:07 hashar: castor: deleting potentially corrupted npm cache. On integration-castor05: sudo rm -fR /srv/castor/castor-mw-ext-and-skins/master/<nowiki>{</nowiki>wmf-quibble-selenium-php74,quibble-vendor-mysql-php74-selenium<nowiki>}</nowiki>/npm # [[phab:T383237|T383237]] == 2025-01-07 == * 22:07 hashar: Deleted /srv/zuul/git/operations/dumps/dcat on both contint1002 and contint2002 # [[phab:T157818|T157818]] * 19:00 bd808: `/usr/local/sbin/clean-stale-puppet-certs --clean` ([[phab:T383153|T383153]]) * 18:53 taavi: taavi@deployment-puppetserver-1:~$ sudo puppetserver ca clean --certname maps-master01.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:50 taavi: taavi@deployment-puppetserver-1:~$ sudo puppet node clean geoshapes.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:30 bd808@cloudcumin1001: END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=1) for instance deployment-etcd04 * 18:30 bd808@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-etcd04 * 14:48 hashar: Manually renamed wikibase-daily-npm-audit-daily-node18-npmaudit to node20 variant and refresh the config with JJB * 14:33 James_F: Zuul: [mediawiki/extensions/WikiLambda] Only run standalone jobs in master == 2025-01-06 == * 20:16 andrewbogott: removed the (non-existent?) role::mw_rc_irc from puppet config for deployment-ircd03.deployment-prep.eqiad1.wikimedia.cloud * 19:35 bd808: Manually generated missing en_US.UTF-8 locale on deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:32 bd808: Added `postgresql::postgis::postgresql_postgis_package: postgresql-15-postgis-3` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:31 bd808: Issued new Puppet cert for deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:27 bd808: Added `postgresql::postgis::postgresql_postgis_package: ignored` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:15 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/71 ([[phab:T382709|T382709]]) * 19:11 bd808: Added placeholders for `graphite_host` and `statsd` to deployment-webperf Prefix Puppet * 18:53 bd808: Fixed missing profile::swift::global_account_keys::<nowiki>{</nowiki>codfw, eqiad<nowiki>}</nowiki> placeholders breaking deployment-ms-* puppet runs * 18:38 bd808: Fixed incorrect deployment-restbase prefix puppet setting that was causing puppet run failures * 18:19 bd808: Issued a new Puppet client cert for traindev01.deployment-prep.eqiad1.wikimedia.cloud * 14:58 James_F: Zuul: Drop CI for REL1_41 branch, now EOL per [[phab:T376550|T376550]] * 09:03 hashar: gerrit: flushed diff_intraline, diff_summary, gerrit_file_diff and git_file_diff caches after having turned on diff3 style # [[phab:T359821|T359821]] == 2025-01-02 == * 11:27 hashar: Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1105679 # [[phab:T374113|T374113]] {{SAL-archives/Release Engineering}} <noinclude>[[Category:SAL]]</noinclude> qi3m1vrdlvonxd118i5m2tvrotao2cq 2309632 2309631 2025-06-08T18:12:50Z Stashbot 7414 James_F: Zuul: Fold extension-quibble-php81-or-later template into extension-quibble 2309632 wikitext text/x-wiki == 2025-06-08 == * 18:12 James_F: Zuul: Fold extension-quibble-php81-or-later template into extension-quibble * 18:04 James_F: Zuul: [mediawiki/extensions/SemanticVersion] Add basic CI == 2025-06-06 == * 14:37 jnuche: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/79 == 2025-06-05 == * 23:21 thcipriani: update scap in beta to 4.171.0 to match prod * 20:44 James_F: Zuul: [wikimedia-ui-base] Sunset WikimediaUI Base, archive repo's CI, for [[phab:T354310|T354310]] * 20:20 bd808: Added `profile::memcached::firewall_src_sets: ~` to deployment-memc prefix puppet ([[phab:T396109|T396109]]) * 19:03 Krinkle: Update profile::tlsproxy::envoy::cfssl_options under deployment-mediawiki in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. ref [[phab:T289318|T289318]] * 18:26 James_F: Docker: Re-build PHP images with php-uuid (and incidentally bump versions), for [[phab:T373752|T373752]] * 17:14 James_F: Docker: [mediawiki-phan-testrun] Migrate parent image from php74 to php81 * 17:10 James_F: Docker: [phpmetrics] Migrate parent image from php74 to php81 * 17:10 James_F: Where will Abstract Content go? * 17:07 James_F: Zuul: [mediawiki/extensions/WikimediaMaintenance] Add dependencies, for [[phab:T58074|T58074]] * 16:39 James_F: Zuul: [mediawiki/tools/phan/PerfCheckPlugin] Use a template for CI * 16:37 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Stop testing in PHP 7.4 * 16:36 James_F: Zuul: [labs/tools/heritage] Raise PHP testing from 7.4 to 8.1 * 16:34 James_F: Zuul: Stop testing most libraries and tools in PHP 7.4 * 16:28 James_F: Zuul: Stop testing PHP extensions with PHP 7.4 * 16:26 James_F: Zuul: [integration/quibble] Stop testing in PHP 7.4, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 16:23 James_F: Zuul: [mediawiki/services/parsoid] Stop testing in PHP 7.4 * 16:21 James_F: Zuul: [operations/mediawiki-config] Stop testing in PHP 7.4 * 16:09 James_F: Zuul: Drop all PHP 7.4 testing for MediaWiki things, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 04:46 Krinkle: gitpuppet@deployment-puppetserver-1:/srv/git/operations/puppet$ Cherry-pick https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153764, ref [[phab:T289318|T289318]] * 03:58 Krinkle: Update profile::cache::haproxy::available_unified_certificates under deployment-cache in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. Remove `*.zero.wikipedia.beta.wmflabs.org` which wasn't responding/didn't work anymore. ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there), ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there) * 00:32 Krinkle: Add `TXT *.wikimedia.beta.wmcloud.org. "v=spf1 -all"` to match beta.wmflabs.org DNS (ref [[phab:T289318|T289318]], changing email is out of scope for now, but might as well add the DNS records). * 00:22 Krinkle: Adding missing DNS entries under beta.wmcloud.org. There was already: *.wikipedia, *.m.wikimedia, *.wikivoyage, *.m.wikivoyage (for [[phab:T355281|T355281]]). Adding: wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary, wikidata, upload ([[phab:T289318|T289318]]). == 2025-06-04 == * 21:27 James_F: Zuul: [mediawiki/extensions/Springboard] Add basic CI, for [[phab:T395981|T395981]] * 12:10 lucaswerkmeister: lucaswerkmeister@deployment-deploy04:~$ mwscript createAndPromote commonswiki --interface-admin --force 'Lucas Werkmeister' # w-beta.wmflabs.org/mt == 2025-06-03 == * 23:59 James_F: Zuul: [mediawiki/services/<some>] Upgrade test suite to Node 24 & 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikimedia/portals] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikipeg] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:55 James_F: Zuul: [oojs/*i] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:53 James_F: Zuul: [wikimedia/portals/deploy] Drop tests, this repo isn't testable * 23:20 James_F: Zuul: Provide experimental Node 24 jobs where Node 22 ones exist, for [[phab:T395926|T395926]] * 17:09 bd808: Forced puppet run on deployment-webperf21 to pick up Kafka config changes ([[phab:T391273|T391273]]) * 17:08 bd808: Manually expanded (duplicated) jumbo-eqiad and main-eqiad aliases in kafka_clusters hiera config ([[phab:T391273|T391273]]) * 17:04 bd808: Added jumbo-eqiad and main-eqiad aliases to kafka_clusters hiera config ([[phab:T391273|T391273]]) * 16:00 James_F: Docker: Provide initial Node 24 images, for [[phab:T395923|T395923]] * 09:53 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo service varnish-frontend restart` for [[phab:T395808|T395808]] * 09:52 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo -i puppet agent -tv` for [[phab:T395808|T395808]] == 2025-06-02 == * 14:37 James_F: Zuul: Add Matrix to CI allowlist * 14:37 James_F: Zuul: [operations/software/gerrit/plugins/events-wikimedia] mark as archived, for [[phab:T304947|T304947]] * 14:36 James_F: Zuul: [mediawiki/extensions/CookieConsent] Add basic CI * 13:45 hashar: Updating Jenkins jobs for "drop obsolete creation of log & src dirs" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1152702 == 2025-05-30 == * 22:16 thcipriani: killed 1000s of zuul merger jobs via https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Very_high_queue_of_merger:merge_functions for parsoid, wikibase, and core * 21:20 bd808: Poked hole in blocked_nets for 188.214.8.0/21 ([[phab:T395709|T395709]]) * 09:43 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 57273 and 57274 == 2025-05-29 == * 22:18 bd808: Submitted WikimediaDebug v3.1.0 to addons.mozilla.org for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) * 22:12 bd808: Submitted WikimediaDebug v3.1.0 to Chrome Web Store for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) == 2025-05-28 == * 20:27 James_F: Zuul: [mediawiki/extensions/ArticleSummaries] Promote to Wikimedia production, for [[phab:T393940|T393940]] * 13:15 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='en_rtlwiki'; and DELETE FROM localnames WHERE ln_wiki='en_rtlwiki'; as part of closing the wiki * 12:30 James_F: Zuul: Add an explanatory note to bluespice template that we skip non-LTSes == 2025-05-24 == * 21:52 Krinkle: Disable publishing notifs on Phab tasks from extension-Chart mirror, [[phab:T143162|T143162]], [[phab:T272803|T272803]] == 2025-05-23 == * 18:36 James_F: Zuul: [mediawiki/core] Restore node testing for release branches, for [[phab:T395141|T395141]] * 17:55 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1149705 == 2025-05-22 == * 21:15 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-upload08 to pick up new config ([[phab:T393404|T393404]]) * 21:12 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-text08 to pick up new config ([[phab:T393404|T393404]]) * 21:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143602 ([[phab:T393404|T393404]]) * 21:09 bd808: Added `block_help: "see https://wikitech.wikimedia.org/wiki/Beta/Blocked_help for more information."` under `profile::cache::varnish::frontend::fe_vcl_config` in both deployment-cache-text and deployment-cache-upload Prefix Puppet ([[phab:T393404|T393404]]) * 20:11 brennen: devtools: phorge: test deploying work/merge-phorge-2024.35 changes * 17:25 bd808: `./jjb-update 'selenium-daily-beta*-MediaWiki'` to deploy updates to selenium-daily-beta-MediaWiki and selenium-daily-betacommons-MediaWiki failure notifications ([[phab:T394551|T394551]]) * 14:45 dancy: Upgrade gitlab-runner to v17.10.1 in gitlab-cloud-runner (staging and production) [[phab:T394953|T394953]] * 11:39 hashar: Triggered replication of mediawiki/extensions/BlueSpiceSmartlist and mediawiki/extensions/BlueSpiceSmartList to fix https://github.com/wikimedia/mediawiki-extensions-BlueSpiceSmartlist {{!}} [[phab:T394903|T394903]] * 11:37 hashar: gerrit: changed parent of mediawiki/extensions/BlueSpiceSmartlist (lower case L) to All-Archived-Projects to prevent it from being replicated to GitHub {{!}} [[phab:T394903|T394903]] == 2025-05-21 == * 07:24 hashar: restarted Gerrit on gerrit1003 * 07:18 hashar: restarted Jenkins on contint1002 == 2025-05-20 == * 17:51 bd808: Open CDN edge blocks to allow traffic from 190.217.20.32/28 * 17:13 dancy: Restarting Jenkins on contint1002 * 16:27 James_F: Docker: [quibble-bullseye-php81-coverage]: Fix clover-edit for py39 * 14:30 James_F: Docker: [quibble-bullseye-php74-coverage] Bump phpunit-patch-coverage to 0.0.15 * 14:28 hashar: integration: cleared Docker build cache on integration-agent-docker-1052 and integration-agent-docker-1061 * 13:49 James_F: Docker: Provide quibble-bullseye-php81-coverage == 2025-05-19 == * 15:48 James_F: Zuul: Switch primary master branch testing to PHP 8.1, not 7.4 * 15:45 James_F: Zuul: Switch / remove any experimental testing to PHP 8.1, not 7.4 * 15:39 James_F: Zuul: Switch REL1_39 branch testing to PHP 8.1, not 7.4 * 15:37 James_F: Zuul: Switch all wmf branch testing to PHP 8.1, not 7.4 * 13:25 James_F: Zuul: Simplify the regular Quibble job name to drop 'noselenium' * 13:24 James_F: jjb: Simplify the regular Quibble job name to drop 'noselenium' * 12:18 hashar: integration: cleaned Docker build cache on integration-agent-docker-1045 * 09:26 hashar: integration: cleaned Docker build cache on integration-agent-docker-1040 == 2025-05-16 == * 16:57 James_F: Zuul: Split Quibble jobs into selenium-only and non-selenium for skins == 2025-05-15 == * 21:22 bd808: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146722 * 13:54 James_F: Zuul: [mediawiki/extensions/Realnames] Use vendor quibble, not composer * 09:34 codders: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146520 == 2025-05-14 == * 21:31 bd808: Restarted varnish-frontend on deployment-cache-text08 to pick up blocked_nets changes ([[phab:T394311|T394311]]) * 16:06 hashar: Updating jobs for "jjb: silence some shell blocks in macro-docker.yaml" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145090 {{!}} [[phab:T393847|T393847]] * 13:43 hashar: Reloded Zuul for Zuul: [mediawiki/extensions/Wikibase] Enable Open Search for apitests jobs {{!}} https://gerrit.wikimedia.org/r/1145331 {{!}} [[phab:T386691|T386691]] == 2025-05-13 == * 19:27 James_F: Zuul: Upgrade all Quibble 'apitests' jobs from 7.4 to 8.1, for [[phab:T386691|T386691]], [[phab:T328921|T328921]], [[phab:T328922|T328922]] * 12:35 dcausse: deployment-prep: reindexing wikidata to pickup the "mul" language field ([[phab:T392058|T392058]]) * 08:23 hashar: Update jobs to mute checks for npm packaging files {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145087/ {{!}} [[phab:T393847|T393847]] == 2025-05-12 == * 16:48 hashar: Updated Jenkins jobs to silence git in ci-src-setup (take 2) {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 16:46 bd808: Reenabled beta-scap-sync-world and beta-update-databases-eqiad Jenkins jobs * 15:55 hashar: Updated Jenkins jobs to silence git in ci-src-setup {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 15:50 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud. Attempting to fix a "Found non-revoked Puppet certificates for 1 deleted instances" Prometheus alert. * 15:28 bd808: Forced puppet run on deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:28 bd808: Forced puppet run on deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:22 bd808: Added `prometheus::instances` and `prometheus::instances_defaults` hiera settings to "deployment-etcd" Prefix Puppet via Horizon ([[phab:T393866|T393866]]) * 12:30 Krinkle: Disable publishing noise from rWSWF, [[phab:T143162|T143162]], [[phab:T267223|T267223]] * 09:52 hashar: Updating all jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1143972 "Omit noisy `ls` debugging commands when not needed" # [[phab:T282893|T282893]] & [[phab:T393847|T393847]] * 08:28 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] * 08:15 hashar: Updated jobs for "Replace all uses of `$(pwd)` with `$PWD`" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1143967/ * 07:58 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] == 2025-05-08 == * 20:28 dancy: Updating buildkitd to v0.21.1 in gitlab-cloud-runners * 10:58 James_F: Zuul: Support capital first letter of e-mail for Aeywoo in allow list == 2025-05-07 == * 08:52 hashar: Updating Jenkins jobs to Quibble 1.14.1 * 07:03 hashar: Hard rebooted integration-agent-docker-1061 via Horizon, the instance is not reachable by ssh and looks bricked # [[phab:T393542|T393542]] * 06:58 hashar: Change ssh credentials for integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 to `key to connect to labs instances set up with role::ci::slave::labs::common` # [[phab:T393543|T393543]] * 06:57 hashar: Added label `blubber` and `pipelinelib` to integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 # [[phab:T393543|T393543]] * 06:41 hashar: integration: bring back integration-agent-docker-1062 , I had it disconnected on April 30 at 6:30am UTC to clean /srv/jenkins/workspace and apparently forgot to put it back online == 2025-05-06 == * 16:16 hashar: restarting CI Jenkins due to a deadlock affecting castor-save-workspace which ends up blocking jobs # [[phab:T353925|T353925]] * 15:06 hashar: Tag Quibble 1.4.1 @ {{Gerrit|5247438621f802ba9878970b3b34b2d67cefa54c}} == 2025-05-05 == * 14:32 hashar: contint1002 and contint2002: deleted /srv/docker/buildkit following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 13:50 hashar: contint1002 and contint2002: deleted /srv/docker/image/overlay2 following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 09:45 hashar: Cleared /srv/docker/overlay2 on contint2002 * 09:41 hashar: Cleared /srv/docker/overlay2 on contint1002 (it had bunch of old layers from April/May 2024) == 2025-05-04 == * 13:10 hashar: contint1002: deleted old videos from /srv/jenkins/builds * 08:59 James_F: Zuul: [AbuseFilter] Add CommunityConfiguration as a Phan dependency, for [[phab:T393240|T393240]] * 06:33 James_F: Zuul: [mediawiki/extensions/PageImages] Add Scribunto phan dependency, for [[phab:T131911|T131911]] * 06:33 James_F: Zuul: [mediawiki/extensions/WikimediaEvents] Add CLDR dependency == 2025-05-03 == * 10:28 James_F: Zuul: [mediawiki/extensions/PageAssessments] Add Scribunto phan dependency, for [[phab:T380122|T380122]] == 2025-05-02 == * 17:39 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add Echo as a phan dep * 12:30 James_F: Zuul: [mediawiki/extensions/CodeEditor] Add BetaFeatures phan dependency, for [[phab:T373711|T373711]] * 12:18 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst voting again * 08:43 hashar: Updating Quibble jobs to 1.14.0 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1140215 {{!}} [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 07:00 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as full CI dep too, for [[phab:T391230|T391230]] * 06:52 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as phan dependency, for [[phab:T391230|T391230]] == 2025-04-30 == * 23:46 dancy: Re-enabled https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ * 18:54 dancy: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad while Gerrit is down. * 15:50 hashar: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1140203 * 15:01 hashar: Tagged Quibble 1.14.0 @ {{Gerrit|6d7c736d12daa7ea23b261ede02093f8fe7a83ae}} # [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 06:30 hashar: integration: cleared /srv/jenkins/workspace on integration-agent-docker-1062 == 2025-04-29 == * 21:04 mutante: integration-agent-docker-1051.integration - killall -9 ffmpeg - [[phab:T392963|T392963]] * 20:28 mutante: integration-agent-docker-1048.integration - killall -9 ffpmeg - [[phab:T392963|T392963]] == 2025-04-28 == * 19:01 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1139536 * 15:49 dancy: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/76 * 13:05 James_F: Docker: Bump Node20 and Node22 binaries to latest and cascade == 2025-04-26 == * 00:05 bd808: Punched a hole in the beta cluster network blocks to allow 38.242.176.0/22 through. == 2025-04-24 == * 19:54 thcipriani: deployment-cache-text08: systemctl reload varnish-frontend following puppet run change to /etc/varnish/blocked-nets.inc.vcl * 19:49 thcipriani: deployment-cache-text08: sudo puppet-run to pick up https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/42c7880be27913c9e841642d9ff3e50deb455e08 * 15:32 bd808: Punched a hole in the beta cluster network blocks to allow 47.144.0.0/12 through. ([[phab:T392534|T392534]]) * 14:41 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (production) * 14:34 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (staging) == 2025-04-23 == * 22:59 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:43 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:15 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up a huge pile of new blocks ([[phab:T392534|T392534]]) * 22:11 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Switch Node 20 CI on, for [[phab:T382177|T382177]] * 21:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 21:29 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 20:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 17:43 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Disable CI for now, for [[phab:T382177|T382177]] * 16:57 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/commit/a80e5211100f1cc42e4ae020d4266ea22938eb5a ([[phab:T383097|T383097]]) * 14:25 James_F: Zuul: [wikimedia/portals] Switch to Node 20, for [[phab:T382179|T382179]] == 2025-04-17 == * 10:15 hashar: gerrit: reparented apps.git to All-Archived-Projects.git in order to BLOCK `mediawiki-replication`. I have also archived all subprojects # [[phab:T392198|T392198]] == 2025-04-16 == * 20:59 bd808: Blocked 193.43.72.0/24 and 14.160.0.0/11 because beta was very, very sad * 16:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst non-voting for now * 09:20 hashar: integration: restarted integration-puppetserver-01 == 2025-04-15 == * 22:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst job voting, for [[phab:T368002|T368002]] * 19:40 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392003|T392003]]) * 18:11 bd808: `bd808@deployment-cache-text08:~$ sudo service varnish-frontend restart` ([[phab:T392003|T392003]]) * 18:06 bd808: `sudo puppet agent -tv` on deployment-cache-text08 to update varnish deny list ([[phab:T392003|T392003]]) * 17:30 bd808: `shutdown -r now` on deployment-mediawiki14. Load has been growing for ~2 days. == 2025-04-11 == * 19:53 James_F: Zuul: [oojs/router] Mark as archived, for [[phab:T391709|T391709]] * 14:00 hashar: restarted integration-puppetserver: jvm went out of memory == 2025-04-10 == * 23:40 bd808: Removed wikifunctions from deployment-cache prefix puppet's profile::cache::haproxy::available_unified_certificates::server_names. https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/6af09ceaa6d261c910fb4b42d7b3e8b8172c8041%5E%21/ * 23:36 bd808: Deleted m.wikifunctions.beta.wmflabs.org, *.wikifunctions.beta.wmflabs.org, and wikifunctions.beta.wmflabs.org DNS records per [[Special:Diff/2292116]]. All 3 were pointing to 185.15.56.36. * 14:16 hashar: deployment-prep: `profile::mediawiki::php::increase_open_files: True` on https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-mediawiki # [[phab:T389422|T389422]] * 14:03 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='wikifunctionswiki'; and DELETE FROM localnames WHERE ln_wiki='wikifunctionswiki'; for [[phab:T391511|T391511]] == 2025-04-08 == * 22:39 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135128 * 22:15 bd808: Manually deleted 'deployment-wikikube-v127' Magnum cluster template via Horizon. Deletion via OpenTofu has timed out repeatedly. * 22:08 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135123 * 22:02 brennen: Updating docker-pkg files on contint primary for [[phab:T383065|T383065]] * 21:11 James_F: Beta Cluster: Shutting of deployment-docker-wikifunctions01, we decom'ing it. * 20:44 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1135098 == 2025-04-07 == * 17:20 bd808: `service navtiming stop` to halt "Unhandled exception in main loop, restarting consumer" crash loop ([[phab:T391272|T391272]]) * 17:15 bd808: Reboot deployment-webperf21 ([[phab:T391272|T391272]]) * 16:58 bd808: `puppet agent -tv` to catch up with missed puppet runs on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:56 bd808: `rm /var/log/user.log.1` on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:47 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1 to clean up dangling certs for deployment-elastic<nowiki>{</nowiki>09,10,11<nowiki>}</nowiki> == 2025-04-04 == * 09:42 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 35782 and 35784 * 09:09 hashar: Update tox jobs to default to python 3.9 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1134168 * 08:53 hashar: Updating Quibble jobs to catch up with latest image https://gerrit.wikimedia.org/r/c/integration/config/+/1134167 {{!}} [[phab:T3666646|T3666646]] * 00:35 thcipriani: integration-agent-docker-1041 marked offline due to /srv disk space * 00:09 Krinkle: Disable duplicate publishing noise from extension-MediaUploader, ref [[phab:T143162|T143162]], [[phab:T389450|T389450]] == 2025-04-03 == * 15:06 James_F: Zuul: Configure the REL1_44 test and gate pipelines, for [[phab:T390695|T390695]] * 13:33 James_F: Docker: [quibble-bullseye] Revert MardiaDB to 10.5, for (against) [[phab:T366646|T366646]] * 13:08 James_F: Zuul: [mediawiki/extensions/MetricsPlatform] Publish JS docs == 2025-04-02 == * 13:39 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133383 [[phab:T390754|T390754]] * 12:36 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133379 https://gerrit.wikimedia.org/r/1133380 * 12:20 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133373 == 2025-04-01 == * 20:46 James_F: Zuul: Swap the branch check to specific release branches, for [[phab:T390754|T390754]] etc. * 20:34 James_F: Docker: [quibble-bullseye] Switch MariaDB to 10.6 Wikimedia package, for [[phab:T366646|T366646]] * 20:26 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133238 * 20:09 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133231 * 19:31 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133221 [[phab:T390754|T390754]] * 18:40 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133209 [[phab:T390772|T390772]] * 16:53 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133184 [[phab:T390754|T390754]] == 2025-03-31 == * 18:26 dancy: Reloading Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1132688 * 15:20 James_F: Zuul: [mediawiki/extensions/EmailAuth] Mark as in Wikimedia production, move up, for [[phab:T390437|T390437]] * 11:08 dcausse: [[phab:T389971|T389971]]: deleting deployment-elastic* VMs in deployment-prep * 08:24 dcausse: [[phab:T389971|T389971]]: shutting down deployment-elastic* VMs in deployment-prep == 2025-03-28 == * 22:01 Krinkle: Disable duplicate publishing noise from extension-LoginNotify, ref [[phab:T143162|T143162]], [[phab:T390315|T390315]] * 21:39 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 * 21:15 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 == 2025-03-27 == * 16:28 bd808: Moved Puppet configuration from deployment-cache-text08 to deployment-cache-text prefix Puppet * 16:05 bd808: `sudo systemctl restart varnish-frontend` on deployment-cache-text08 ([[phab:T390209|T390209]]) * 15:05 bd808: Moved role::acme_chief::cloud from individual instance config to deployment-acme-chief Puppet prefix. * 00:55 bd808: Removed prefix puppet classes for deployment-acme-chief ([[phab:T390128|T390128]]) == 2025-03-26 == * 20:23 inflatador: bking@deployment-prep populating new OpenSearch cluster indices a la https://wikitech.wikimedia.org/w/index.php?title=Search&oldid=2164435#Adding_new_wikis [[phab:T389971|T389971]] * 17:10 inflatador: bking@deployment-prep reverted an accident replacement of deployment-acme-chief.yaml [[phab:T389971|T389971]] * 15:04 dancy: Update gitlab-runners to v17.8.4 in gitlab-cloud-runners staging and production. * 00:30 bd808: Delete parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud service name again ([[phab:T389252|T389252]]) == 2025-03-25 == * 21:11 jeena: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130722 * 04:18 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1130729 == 2025-03-24 == * 19:35 hashar: Updating Jenkins jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1130700 == 2025-03-23 == * 18:41 James_F: Zuul: Add 0xDeadbeef to CI allowlist * 18:34 James_F: Zuul: [operations/debs/bdsync] Mark as archived, for [[phab:T377882|T377882]] * 18:31 James_F: Zuul: [mediawiki/extensions/CheckUser] Add GrowthExperiments dependency, for [[phab:T386435|T386435]] * 18:29 James_F: Zuul: [mediawiki/extensions/CategoryWatch] Add Echo CI dependency == 2025-03-20 == * 23:31 bd808: integration: thcipriani added integration-agent-docker-106<nowiki>{</nowiki>0,1,2<nowiki>}</nowiki> earlier today ([[phab:T389554|T389554]]) * 22:50 brennen: integration: added jenkins nodes for integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> with 3 executors per each ([[phab:T389554|T389554]]) * 21:41 brennen: integration: launched integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> ([[phab:T389554|T389554]]) * 21:25 eileen: civicrm upgraded from {{Gerrit|7b532ad7}} to {{Gerrit|fba4c3d6}} * 15:13 dancy: Rebooting integration-agent-docker-1046 (Seems to be be inaccessible since February) * 08:28 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1129765 == 2025-03-19 == * 20:32 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1129364 * 00:12 bd808: Trying the simplest thing that might work by adding a CNAME record for parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. ([[phab:T389252|T389252]]) == 2025-03-18 == * 20:25 bd808: Rebooting deployment-jobrunner05 because things just seem weird ([[phab:T387631|T387631]], [[phab:T387276|T387276]]) * 15:18 sergi0: run CommunityUpdates config schema migration `foreachwikiindblist growthexperiments extensions/CommunityConfiguration/maintenance/migrateConfig.php CommunityUpdates` ([[phab:T387737|T387737]]) == 2025-03-14 == * 21:36 Reedy: deployed https://gerrit.wikimedia.org/r/1127982 * 16:55 Lucas_WMDE: manually killed job https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/2928/console which had been stuck since 16:33 UTC, blocking gate-and-submit :( == 2025-03-13 == * 21:29 dancy: Finished gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 20:42 dancy: Finished gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) * 20:09 dancy: Starting gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 19:26 dancy: Starting gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) == 2025-03-11 == * 22:54 bd808: Deleted unattached volumes: alert01, db09, deploy03, mwmaint, ores02, parsoid14-srv, prometheus05 * 22:39 bd808: Released unused floating IPs 185.15.56.9 and 185.15.56.97 back to global pool * 22:08 bd808: Updated mail.beta.wmflabs.org service name to point to 185.15.56.115 * 22:04 bd808: Deleted orphan parsoid-external-ci-access.beta.wmflabs.org. DNS record * 21:53 bd808: Deleted dangling prometheus-beta.wmcloud.org web proxy * 21:50 bd808: Deleted dangling w-beta.wmflabs.org web proxy * 21:42 bd808: Deleted unused "deployment-parsoid" Prefix Puppet configuration * 20:48 James_F: Docker: [quibble-bullseye-php81 & php81] Use PCRE2 backport from component/php81, for [[phab:T386006|T386006]] * 13:19 James_F: Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived, for [[phab:T382069|T382069]] * 03:54 eileen: civicrm upgraded from {{Gerrit|f2222fcd}} to {{Gerrit|ec20a105}} == 2025-03-10 == * 15:20 James_F: Zuul: [mediawiki/services/servicelib-node] Mark as archived, for [[phab:T388424|T388424]] * 13:47 hashar: gerrit: removed leftover empty directory `/srv/gerrit/plugins/lfs`. Data have been migrated to `/srv/gerrit/plugins/lfs` as part of moving Gerrit data out of `/`. See [[phab:T333143|T333143]] == 2025-03-08 == * 01:22 James_F: Zuul: [php-session-serializer] Enable PHP 8.4 as voting, for [[phab:T368270|T368270]] == 2025-03-07 == * 21:00 James_F: Zuul: [mediawiki/libs/Shellbox] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:53 James_F: Zuul: [wikipeg] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:07 James_F: Zuul: [mediawiki/libs/Equivset] Enable PHP 8.4 as voting, for [[phab:T387806|T387806]] == 2025-03-05 == * 00:21 dancy: Reeanbled beta-scap-sync-world ([[phab:T166010|T166010]]) == 2025-03-04 == * 23:26 dancy: Disabling beta-scap-sync-world for noise reduction while dealing with [[phab:T166010|T166010]] * 22:10 James_F: Zuul: [mediawiki/services/example-node-api] Mark as archived, for [[phab:T387933|T387933]] * 01:42 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Disable on PHP 8.4, for [[phab:T386570|T386570]] * 01:13 James_F: Zuul: Add WgevaertWikiBase to CI allowlist * 01:03 James_F: Zuul: Start testing in PHP 8.4 for 'mediawiki-php-library' repos, for [[phab:T386108|T386108]] == 2025-02-28 == * 18:20 dancy: Upgrading gitlab-runner to v17.7.1 in production gitlab-cloud-runners ([[phab:T386297|T386297]]) * 18:12 dancy: Upgrading gitlab-runner to v17.7.1 in staging gitlab-cloud-runners ([[phab:T386297|T386297]]) * 17:52 dancy: Upgraded scap to 4.138.0 in beta cluster * 16:43 bd808: Deleted now dangling parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. DNS record ([[phab:T385849|T385849]]) * 16:40 bd808: Deleted deployment-parsoid14.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:39 bd808: Deleted parsoid-external-ci-access.wmcloud.org proxy ([[phab:T385849|T385849]]) * 16:37 bd808: Deleted deployment-alert01.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:36 bd808: Deleted deployment-bastion.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) == 2025-02-27 == * 01:11 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1123063 [[phab:T386476|T386476]] == 2025-02-26 == * 20:21 James_F: jforrester@doc1003:~$ sudo -u doc-uploader rm -rf /srv/doc/cover-extensions/LdapAuthentication/ #[[phab:T376097|T376097]] * 20:18 James_F: Zuul: [mediawiki/extensions/LdapAuthentication] Mark as archived, for [[phab:T376097|T376097]] * 13:20 hashar: Updating Quibble jobs to 1.13.0. "Skip execution upon a success cache hit" which would make some jobs to skip tests entirely when a set of commits/image is known to have previously passed # [[phab:T383243|T383243]] {{!}} dduvall * 11:06 hashar: Tag Quibble 1.13.0 @ {{Gerrit|0ac128f7bc060c82f11317aabaf78a10b24aeeec}} # [[phab:T383243|T383243]] * 09:11 hashar: deployment-prep: cherry picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122901 "php: use component/pcre2 when using Php 8.1" to fix php # [[phab:T387276|T387276]] * 01:55 bd808: `./jjb-update 'integration-quibble-fullrun-*-php81' '*-php81-phan' '*php81*'` * 01:16 Reedy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1122700 [[phab:T386006|T386006]] == 2025-02-25 == * 20:25 James_F: Docker: [php81] Update PHP to 8.1.31-1+wmf11u4, for [[phab:T386006|T386006]] * 14:07 James_F: Docker: [php81] Upgrade Wikimedia's PHP to 8.1.31-1+wmf11u3 & PCRE to 10.42 for [[phab:T386006|T386006]] == 2025-02-24 == * 01:02 jeena: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/73 == 2025-02-22 == * 11:27 taavi: rebooting integration-agent-docker-1047 which thinks it is gerrit == 2025-02-21 == * 22:54 brennen: gitlab: removing expiration date for 14 tokens expiring in 2025 ([[phab:T385930|T385930]]) * 22:36 brennen: gitlab: set require_personal_access_token_expiry and service_access_tokens_expiration_enforced to false == 2025-02-20 == * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners ([[phab:T386955|T386955]]) * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners == 2025-02-19 == * 21:28 dancy: Reenabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ ([[phab:T386851|T386851]]) * 19:35 dduvall: restarting jenkins to fix git related issues following java update ([[phab:T386755|T386755]]) * 15:47 dancy: Disabled the https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ job to reduce noise while the problem is being debugged. == 2025-02-18 == * 16:49 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1119815 * 16:11 James_F: Zuul: [operations/debs/dnsdist] Revert archival == 2025-02-13 == * 13:57 James_F: Zuul: [mediawiki/extensions/CirrusSearch] Drop WikibaseCirrusSearch dep, for [[phab:T386015|T386015]] == 2025-02-12 == * 17:22 James_F: Zuul: Add User:Michi j to CI allowlist * 17:21 James_F: Zuul: Add Dragoniez to CI allowlist == 2025-02-11 == * 15:43 James_F: Zuul: Make PHP 8.4 voting on lib repos where it already passes, for [[phab:T386108|T386108]] == 2025-02-10 == * 14:27 James_F: Zuul: Add Bunnypranav to CI allowlist == 2025-02-08 == * 00:07 bd808: Added `profile::maps::osm_master::disable_waterlines_import_timer: false` to deployment-maps prefix hiera ([[phab:T385921|T385921]]) == 2025-02-07 == * 22:14 brennen: phab/phorge: replaced mr-widget token in deployed config ([[phab:T385480|T385480]]) * 21:33 bd808: Added `profile::restbase::parsoid_uri: https://phabricator.wikimedia.org/T385902` to deployment-restbase prefix puppet ([[phab:T385902|T385902]]) * 01:34 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117997 to deployment-puppetmaster ([[phab:T385849|T385849]]) * 00:42 bd808: Shutoff deployment-parsoid14 to see if anything breaks/anyone yells ([[phab:T385849|T385849]]) == 2025-02-06 == * 23:53 bd808: Updated citoid-beta.wmflabs.org to point to deployment-docker-citoid02 * 23:50 bd808: Deleted beta-prometheus.wmflabs.org; it was pointed to an IP now owned by the mdwikioffline project. * 23:43 bd808: Deleted recently orphaned spiderpig.wmcloud.org proxy after discussion with dancy * 16:20 bd808: Rebooted deployment-sessionstore06 ([[phab:T385803|T385803]]) * 12:07 andrewbogott: rebooting all servers for [[phab:T385264|T385264]] == 2025-02-05 == * 19:17 James_F: Zuul: [mediawiki/extensions/DonationInterface] Switch CI from PHP74 to PHP82 * 18:23 James_F: Zuul: [mediawiki/extensions/cldr] Raise FR-special job to REL1_43 * 18:22 James_F: Zuul: [mediawiki/extensions/DonationInterface] Raise FR-special job to REL1_43 * 18:11 James_F: Zuul: [labs/tools/heritage] Fold template into this, only user * 18:08 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Test in PHP 8.2+ only * 17:29 James_F: Zuul: [mediawiki/core] Test fundraising branches against PHP 8.2 * 17:19 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Mark as non-prod == 2025-02-03 == * 12:34 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115782 == 2025-01-30 == * 15:12 James_F: Zuul: [mediawiki/extensions/Wikibase] Only inject EntitySchema on 1.43+, for [[phab:T385175|T385175]] * 01:39 James_F: Zuul: [mediawiki/core] Remove composer variant from wmf branches * 00:42 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115131 == 2025-01-29 == * 18:03 James_F: Zuul: Make FR REL1_43-php82 voting for cldr and FEU * 17:54 James_F: Zuul: Add FR REL1_43-php82 as experimental to other extensions * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Add FR REL1_43-php82 as experimental * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Re-enable FR-tech job as voting, passes fine * 16:57 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115064 * 16:33 hashar: gerrit: marked all legacy Puppet modules as read-only ( https://gerrit.wikimedia.org/r/admin/repos/q/filter:operations/puppet/ ) and removed the associated GitHub mirrors that existed for some of them == 2025-01-28 == * 17:46 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1113550 ([[phab:T383337|T383337]]) * 17:38 dancy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1113549 ([[phab:T383337|T383337]]) * 10:07 hashar: Manually cleaned integration-agent-docker-1043 == 2025-01-27 == * 18:17 hashar: Cleaned disk on integration-agent-docker-1051 == 2025-01-25 == * 09:20 taavi: reloading zuul for https://gerrit.wikimedia.org/r/1113739 == 2025-01-24 == * 21:44 James_F: Revert "Zuul: Switch Fundraising jobs to REL1_43" == 2025-01-23 == * 16:31 dancy: Updating production gitlab-cloud-runners to v17.6.1 * 16:23 dancy: Updating staging gitlab-cloud-runners to v17.6.1 == 2025-01-22 == * 18:14 James_F: Zuul: [mediawiki/extensions/WikiLambda] Add Wikibase as a phan dependency == 2025-01-20 == * 09:55 hashar: Updating Quibble jobs to enable success cache experiment - [[phab:T383243|T383243]] * 08:20 hashar: Updating all Jenkins jobs to update Quibble to 1.12.0 == 2025-01-17 == * 16:59 dduvall: Building Docker images for Quibble 1.12.0 * 15:00 hashar: Building Docker images for Quibble 1.12.0 * 12:56 hashar: Tag Quibble 1.12.0 @ {{Gerrit|633099ead3ec72180e7890e1980074b4fde56c26}} # [[phab:T365978|T365978]], [[phab:T383243|T383243]] == 2025-01-14 == * 17:14 brennen: integration project: create integration-agent-docker-1059 for [[phab:T383254|T383254]] * 16:50 brennen: integration project: create integration-agent-docker-1058 for [[phab:T383254|T383254]] == 2025-01-10 == * 15:55 dancy: Updating gitlab-cloud-runners (prod) to v17.5.5 ([[phab:T383263|T383263]]) * 15:49 dancy: Updating gitlab-cloud-runners (staging) to v17.5.5 == 2025-01-09 == * 22:20 brennen: gitlab: Feature.enable(:kubernetes_agent_protected_branches) - https://docs.gitlab.com/ee/user/clusters/agent/ci_cd_workflow.html#restrict-access-to-the-agent-to-protected-branches * 18:08 James_F: Docker: [node22] Update Node to v22.13.0, & switch base image to bookworm, for [[phab:T383337|T383337]] * 17:01 James_F: Docker: [node20] Update Node to v20.18.1, & switch base image to bookworm, for [[phab:T383337|T383337]] * 15:13 James_F: Docker: [sury-php] Re-platform to bookworm == 2025-01-08 == * 22:07 hashar: castor: deleting potentially corrupted npm cache. On integration-castor05: sudo rm -fR /srv/castor/castor-mw-ext-and-skins/master/<nowiki>{</nowiki>wmf-quibble-selenium-php74,quibble-vendor-mysql-php74-selenium<nowiki>}</nowiki>/npm # [[phab:T383237|T383237]] == 2025-01-07 == * 22:07 hashar: Deleted /srv/zuul/git/operations/dumps/dcat on both contint1002 and contint2002 # [[phab:T157818|T157818]] * 19:00 bd808: `/usr/local/sbin/clean-stale-puppet-certs --clean` ([[phab:T383153|T383153]]) * 18:53 taavi: taavi@deployment-puppetserver-1:~$ sudo puppetserver ca clean --certname maps-master01.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:50 taavi: taavi@deployment-puppetserver-1:~$ sudo puppet node clean geoshapes.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:30 bd808@cloudcumin1001: END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=1) for instance deployment-etcd04 * 18:30 bd808@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-etcd04 * 14:48 hashar: Manually renamed wikibase-daily-npm-audit-daily-node18-npmaudit to node20 variant and refresh the config with JJB * 14:33 James_F: Zuul: [mediawiki/extensions/WikiLambda] Only run standalone jobs in master == 2025-01-06 == * 20:16 andrewbogott: removed the (non-existent?) role::mw_rc_irc from puppet config for deployment-ircd03.deployment-prep.eqiad1.wikimedia.cloud * 19:35 bd808: Manually generated missing en_US.UTF-8 locale on deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:32 bd808: Added `postgresql::postgis::postgresql_postgis_package: postgresql-15-postgis-3` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:31 bd808: Issued new Puppet cert for deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:27 bd808: Added `postgresql::postgis::postgresql_postgis_package: ignored` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:15 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/71 ([[phab:T382709|T382709]]) * 19:11 bd808: Added placeholders for `graphite_host` and `statsd` to deployment-webperf Prefix Puppet * 18:53 bd808: Fixed missing profile::swift::global_account_keys::<nowiki>{</nowiki>codfw, eqiad<nowiki>}</nowiki> placeholders breaking deployment-ms-* puppet runs * 18:38 bd808: Fixed incorrect deployment-restbase prefix puppet setting that was causing puppet run failures * 18:19 bd808: Issued a new Puppet client cert for traindev01.deployment-prep.eqiad1.wikimedia.cloud * 14:58 James_F: Zuul: Drop CI for REL1_41 branch, now EOL per [[phab:T376550|T376550]] * 09:03 hashar: gerrit: flushed diff_intraline, diff_summary, gerrit_file_diff and git_file_diff caches after having turned on diff3 style # [[phab:T359821|T359821]] == 2025-01-02 == * 11:27 hashar: Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1105679 # [[phab:T374113|T374113]] {{SAL-archives/Release Engineering}} <noinclude>[[Category:SAL]]</noinclude> omfkaio7xcvlkiok6dcuc6ic8gj19tr 2309633 2309632 2025-06-08T18:14:04Z Stashbot 7414 James_F: Zuul: [mediawiki/extensions/Echo] Remove EventLogging 2309633 wikitext text/x-wiki == 2025-06-08 == * 18:14 James_F: Zuul: [mediawiki/extensions/Echo] Remove EventLogging * 18:12 James_F: Zuul: Fold extension-quibble-php81-or-later template into extension-quibble * 18:04 James_F: Zuul: [mediawiki/extensions/SemanticVersion] Add basic CI == 2025-06-06 == * 14:37 jnuche: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/79 == 2025-06-05 == * 23:21 thcipriani: update scap in beta to 4.171.0 to match prod * 20:44 James_F: Zuul: [wikimedia-ui-base] Sunset WikimediaUI Base, archive repo's CI, for [[phab:T354310|T354310]] * 20:20 bd808: Added `profile::memcached::firewall_src_sets: ~` to deployment-memc prefix puppet ([[phab:T396109|T396109]]) * 19:03 Krinkle: Update profile::tlsproxy::envoy::cfssl_options under deployment-mediawiki in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. ref [[phab:T289318|T289318]] * 18:26 James_F: Docker: Re-build PHP images with php-uuid (and incidentally bump versions), for [[phab:T373752|T373752]] * 17:14 James_F: Docker: [mediawiki-phan-testrun] Migrate parent image from php74 to php81 * 17:10 James_F: Docker: [phpmetrics] Migrate parent image from php74 to php81 * 17:10 James_F: Where will Abstract Content go? * 17:07 James_F: Zuul: [mediawiki/extensions/WikimediaMaintenance] Add dependencies, for [[phab:T58074|T58074]] * 16:39 James_F: Zuul: [mediawiki/tools/phan/PerfCheckPlugin] Use a template for CI * 16:37 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Stop testing in PHP 7.4 * 16:36 James_F: Zuul: [labs/tools/heritage] Raise PHP testing from 7.4 to 8.1 * 16:34 James_F: Zuul: Stop testing most libraries and tools in PHP 7.4 * 16:28 James_F: Zuul: Stop testing PHP extensions with PHP 7.4 * 16:26 James_F: Zuul: [integration/quibble] Stop testing in PHP 7.4, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 16:23 James_F: Zuul: [mediawiki/services/parsoid] Stop testing in PHP 7.4 * 16:21 James_F: Zuul: [operations/mediawiki-config] Stop testing in PHP 7.4 * 16:09 James_F: Zuul: Drop all PHP 7.4 testing for MediaWiki things, for [[phab:T328921|T328921]] and [[phab:T328922|T328922]] * 04:46 Krinkle: gitpuppet@deployment-puppetserver-1:/srv/git/operations/puppet$ Cherry-pick https://gerrit.wikimedia.org/r/c/operations/puppet/+/1153764, ref [[phab:T289318|T289318]] * 03:58 Krinkle: Update profile::cache::haproxy::available_unified_certificates under deployment-cache in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary. Remove `*.zero.wikipedia.beta.wmflabs.org` which wasn't responding/didn't work anymore. ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there), ref [[phab:T289318|T289318]] * 03:34 Krinkle: Update profile::acme_chief::certificates under deployment-acme-chief prefix in Horizon, to include remaining the wildcard and m-dot subdomains under beta.wmcloud.org for wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary (wikipedia and wikivoyage were already there) * 00:32 Krinkle: Add `TXT *.wikimedia.beta.wmcloud.org. "v=spf1 -all"` to match beta.wmflabs.org DNS (ref [[phab:T289318|T289318]], changing email is out of scope for now, but might as well add the DNS records). * 00:22 Krinkle: Adding missing DNS entries under beta.wmcloud.org. There was already: *.wikipedia, *.m.wikimedia, *.wikivoyage, *.m.wikivoyage (for [[phab:T355281|T355281]]). Adding: wikibooks, wikimedia, wikinews, wikiquote, wikisource, wikiversity, wiktionary, wikidata, upload ([[phab:T289318|T289318]]). == 2025-06-04 == * 21:27 James_F: Zuul: [mediawiki/extensions/Springboard] Add basic CI, for [[phab:T395981|T395981]] * 12:10 lucaswerkmeister: lucaswerkmeister@deployment-deploy04:~$ mwscript createAndPromote commonswiki --interface-admin --force 'Lucas Werkmeister' # w-beta.wmflabs.org/mt == 2025-06-03 == * 23:59 James_F: Zuul: [mediawiki/services/<some>] Upgrade test suite to Node 24 & 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikimedia/portals] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:56 James_F: Zuul: [wikipeg] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:55 James_F: Zuul: [oojs/*i] Upgrade test suite to Node 24 and Node 22, for [[phab:T395926|T395926]] * 23:53 James_F: Zuul: [wikimedia/portals/deploy] Drop tests, this repo isn't testable * 23:20 James_F: Zuul: Provide experimental Node 24 jobs where Node 22 ones exist, for [[phab:T395926|T395926]] * 17:09 bd808: Forced puppet run on deployment-webperf21 to pick up Kafka config changes ([[phab:T391273|T391273]]) * 17:08 bd808: Manually expanded (duplicated) jumbo-eqiad and main-eqiad aliases in kafka_clusters hiera config ([[phab:T391273|T391273]]) * 17:04 bd808: Added jumbo-eqiad and main-eqiad aliases to kafka_clusters hiera config ([[phab:T391273|T391273]]) * 16:00 James_F: Docker: Provide initial Node 24 images, for [[phab:T395923|T395923]] * 09:53 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo service varnish-frontend restart` for [[phab:T395808|T395808]] * 09:52 TheresNoTime: `samtar@deployment-cache-text08:~$ sudo -i puppet agent -tv` for [[phab:T395808|T395808]] == 2025-06-02 == * 14:37 James_F: Zuul: Add Matrix to CI allowlist * 14:37 James_F: Zuul: [operations/software/gerrit/plugins/events-wikimedia] mark as archived, for [[phab:T304947|T304947]] * 14:36 James_F: Zuul: [mediawiki/extensions/CookieConsent] Add basic CI * 13:45 hashar: Updating Jenkins jobs for "drop obsolete creation of log & src dirs" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1152702 == 2025-05-30 == * 22:16 thcipriani: killed 1000s of zuul merger jobs via https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Very_high_queue_of_merger:merge_functions for parsoid, wikibase, and core * 21:20 bd808: Poked hole in blocked_nets for 188.214.8.0/21 ([[phab:T395709|T395709]]) * 09:43 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 57273 and 57274 == 2025-05-29 == * 22:18 bd808: Submitted WikimediaDebug v3.1.0 to addons.mozilla.org for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) * 22:12 bd808: Submitted WikimediaDebug v3.1.0 to Chrome Web Store for review ([[phab:T395190|T395190]], [[phab:T315111|T315111]]) == 2025-05-28 == * 20:27 James_F: Zuul: [mediawiki/extensions/ArticleSummaries] Promote to Wikimedia production, for [[phab:T393940|T393940]] * 13:15 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='en_rtlwiki'; and DELETE FROM localnames WHERE ln_wiki='en_rtlwiki'; as part of closing the wiki * 12:30 James_F: Zuul: Add an explanatory note to bluespice template that we skip non-LTSes == 2025-05-24 == * 21:52 Krinkle: Disable publishing notifs on Phab tasks from extension-Chart mirror, [[phab:T143162|T143162]], [[phab:T272803|T272803]] == 2025-05-23 == * 18:36 James_F: Zuul: [mediawiki/core] Restore node testing for release branches, for [[phab:T395141|T395141]] * 17:55 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1149705 == 2025-05-22 == * 21:15 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-upload08 to pick up new config ([[phab:T393404|T393404]]) * 21:12 bd808: Forced Puppet run and restarted varnins-frontend on deployment-cache-text08 to pick up new config ([[phab:T393404|T393404]]) * 21:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143602 ([[phab:T393404|T393404]]) * 21:09 bd808: Added `block_help: "see https://wikitech.wikimedia.org/wiki/Beta/Blocked_help for more information."` under `profile::cache::varnish::frontend::fe_vcl_config` in both deployment-cache-text and deployment-cache-upload Prefix Puppet ([[phab:T393404|T393404]]) * 20:11 brennen: devtools: phorge: test deploying work/merge-phorge-2024.35 changes * 17:25 bd808: `./jjb-update 'selenium-daily-beta*-MediaWiki'` to deploy updates to selenium-daily-beta-MediaWiki and selenium-daily-betacommons-MediaWiki failure notifications ([[phab:T394551|T394551]]) * 14:45 dancy: Upgrade gitlab-runner to v17.10.1 in gitlab-cloud-runner (staging and production) [[phab:T394953|T394953]] * 11:39 hashar: Triggered replication of mediawiki/extensions/BlueSpiceSmartlist and mediawiki/extensions/BlueSpiceSmartList to fix https://github.com/wikimedia/mediawiki-extensions-BlueSpiceSmartlist {{!}} [[phab:T394903|T394903]] * 11:37 hashar: gerrit: changed parent of mediawiki/extensions/BlueSpiceSmartlist (lower case L) to All-Archived-Projects to prevent it from being replicated to GitHub {{!}} [[phab:T394903|T394903]] == 2025-05-21 == * 07:24 hashar: restarted Gerrit on gerrit1003 * 07:18 hashar: restarted Jenkins on contint1002 == 2025-05-20 == * 17:51 bd808: Open CDN edge blocks to allow traffic from 190.217.20.32/28 * 17:13 dancy: Restarting Jenkins on contint1002 * 16:27 James_F: Docker: [quibble-bullseye-php81-coverage]: Fix clover-edit for py39 * 14:30 James_F: Docker: [quibble-bullseye-php74-coverage] Bump phpunit-patch-coverage to 0.0.15 * 14:28 hashar: integration: cleared Docker build cache on integration-agent-docker-1052 and integration-agent-docker-1061 * 13:49 James_F: Docker: Provide quibble-bullseye-php81-coverage == 2025-05-19 == * 15:48 James_F: Zuul: Switch primary master branch testing to PHP 8.1, not 7.4 * 15:45 James_F: Zuul: Switch / remove any experimental testing to PHP 8.1, not 7.4 * 15:39 James_F: Zuul: Switch REL1_39 branch testing to PHP 8.1, not 7.4 * 15:37 James_F: Zuul: Switch all wmf branch testing to PHP 8.1, not 7.4 * 13:25 James_F: Zuul: Simplify the regular Quibble job name to drop 'noselenium' * 13:24 James_F: jjb: Simplify the regular Quibble job name to drop 'noselenium' * 12:18 hashar: integration: cleaned Docker build cache on integration-agent-docker-1045 * 09:26 hashar: integration: cleaned Docker build cache on integration-agent-docker-1040 == 2025-05-16 == * 16:57 James_F: Zuul: Split Quibble jobs into selenium-only and non-selenium for skins == 2025-05-15 == * 21:22 bd808: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146722 * 13:54 James_F: Zuul: [mediawiki/extensions/Realnames] Use vendor quibble, not composer * 09:34 codders: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1146520 == 2025-05-14 == * 21:31 bd808: Restarted varnish-frontend on deployment-cache-text08 to pick up blocked_nets changes ([[phab:T394311|T394311]]) * 16:06 hashar: Updating jobs for "jjb: silence some shell blocks in macro-docker.yaml" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145090 {{!}} [[phab:T393847|T393847]] * 13:43 hashar: Reloded Zuul for Zuul: [mediawiki/extensions/Wikibase] Enable Open Search for apitests jobs {{!}} https://gerrit.wikimedia.org/r/1145331 {{!}} [[phab:T386691|T386691]] == 2025-05-13 == * 19:27 James_F: Zuul: Upgrade all Quibble 'apitests' jobs from 7.4 to 8.1, for [[phab:T386691|T386691]], [[phab:T328921|T328921]], [[phab:T328922|T328922]] * 12:35 dcausse: deployment-prep: reindexing wikidata to pickup the "mul" language field ([[phab:T392058|T392058]]) * 08:23 hashar: Update jobs to mute checks for npm packaging files {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1145087/ {{!}} [[phab:T393847|T393847]] == 2025-05-12 == * 16:48 hashar: Updated Jenkins jobs to silence git in ci-src-setup (take 2) {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 16:46 bd808: Reenabled beta-scap-sync-world and beta-update-databases-eqiad Jenkins jobs * 15:55 hashar: Updated Jenkins jobs to silence git in ci-src-setup {{!}} https://gerrit.wikimedia.org/r/1144596 {{!}} [[phab:T393847|T393847]] * 15:50 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud. Attempting to fix a "Found non-revoked Puppet certificates for 1 deleted instances" Prometheus alert. * 15:28 bd808: Forced puppet run on deployment-etcd05.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:28 bd808: Forced puppet run on deployment-etcd02.deployment-prep.eqiad1.wikimedia.cloud to fix Puppet run ([[phab:T393866|T393866]]) * 15:22 bd808: Added `prometheus::instances` and `prometheus::instances_defaults` hiera settings to "deployment-etcd" Prefix Puppet via Horizon ([[phab:T393866|T393866]]) * 12:30 Krinkle: Disable publishing noise from rWSWF, [[phab:T143162|T143162]], [[phab:T267223|T267223]] * 09:52 hashar: Updating all jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1143972 "Omit noisy `ls` debugging commands when not needed" # [[phab:T282893|T282893]] & [[phab:T393847|T393847]] * 08:28 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] * 08:15 hashar: Updated jobs for "Replace all uses of `$(pwd)` with `$PWD`" {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1143967/ * 07:58 hashar: Disabled https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ due to a failure with Etcd/expired certificate # [[phab:T393855|T393855]] == 2025-05-08 == * 20:28 dancy: Updating buildkitd to v0.21.1 in gitlab-cloud-runners * 10:58 James_F: Zuul: Support capital first letter of e-mail for Aeywoo in allow list == 2025-05-07 == * 08:52 hashar: Updating Jenkins jobs to Quibble 1.14.1 * 07:03 hashar: Hard rebooted integration-agent-docker-1061 via Horizon, the instance is not reachable by ssh and looks bricked # [[phab:T393542|T393542]] * 06:58 hashar: Change ssh credentials for integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 to `key to connect to labs instances set up with role::ci::slave::labs::common` # [[phab:T393543|T393543]] * 06:57 hashar: Added label `blubber` and `pipelinelib` to integration-agent-docker-1060 integration-agent-docker-1061 and integration-agent-docker-1062 # [[phab:T393543|T393543]] * 06:41 hashar: integration: bring back integration-agent-docker-1062 , I had it disconnected on April 30 at 6:30am UTC to clean /srv/jenkins/workspace and apparently forgot to put it back online == 2025-05-06 == * 16:16 hashar: restarting CI Jenkins due to a deadlock affecting castor-save-workspace which ends up blocking jobs # [[phab:T353925|T353925]] * 15:06 hashar: Tag Quibble 1.4.1 @ {{Gerrit|5247438621f802ba9878970b3b34b2d67cefa54c}} == 2025-05-05 == * 14:32 hashar: contint1002 and contint2002: deleted /srv/docker/buildkit following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 13:50 hashar: contint1002 and contint2002: deleted /srv/docker/image/overlay2 following the deletion of /srv/docker/overlay2 earlier today # [[phab:T393373|T393373]] * 09:45 hashar: Cleared /srv/docker/overlay2 on contint2002 * 09:41 hashar: Cleared /srv/docker/overlay2 on contint1002 (it had bunch of old layers from April/May 2024) == 2025-05-04 == * 13:10 hashar: contint1002: deleted old videos from /srv/jenkins/builds * 08:59 James_F: Zuul: [AbuseFilter] Add CommunityConfiguration as a Phan dependency, for [[phab:T393240|T393240]] * 06:33 James_F: Zuul: [mediawiki/extensions/PageImages] Add Scribunto phan dependency, for [[phab:T131911|T131911]] * 06:33 James_F: Zuul: [mediawiki/extensions/WikimediaEvents] Add CLDR dependency == 2025-05-03 == * 10:28 James_F: Zuul: [mediawiki/extensions/PageAssessments] Add Scribunto phan dependency, for [[phab:T380122|T380122]] == 2025-05-02 == * 17:39 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add Echo as a phan dep * 12:30 James_F: Zuul: [mediawiki/extensions/CodeEditor] Add BetaFeatures phan dependency, for [[phab:T373711|T373711]] * 12:18 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst voting again * 08:43 hashar: Updating Quibble jobs to 1.14.0 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1140215 {{!}} [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 07:00 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as full CI dep too, for [[phab:T391230|T391230]] * 06:52 James_F: Zuul: [mediawiki/extensions/WikimediaMessages] Add cldr as phan dependency, for [[phab:T391230|T391230]] == 2025-04-30 == * 23:46 dancy: Re-enabled https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/ * 18:54 dancy: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad while Gerrit is down. * 15:50 hashar: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1140203 * 15:01 hashar: Tagged Quibble 1.14.0 @ {{Gerrit|6d7c736d12daa7ea23b261ede02093f8fe7a83ae}} # [[phab:T378797|T378797]] [[phab:T384927|T384927]] [[phab:T386691|T386691]] * 06:30 hashar: integration: cleared /srv/jenkins/workspace on integration-agent-docker-1062 == 2025-04-29 == * 21:04 mutante: integration-agent-docker-1051.integration - killall -9 ffmpeg - [[phab:T392963|T392963]] * 20:28 mutante: integration-agent-docker-1048.integration - killall -9 ffpmeg - [[phab:T392963|T392963]] == 2025-04-28 == * 19:01 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1139536 * 15:49 dancy: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/76 * 13:05 James_F: Docker: Bump Node20 and Node22 binaries to latest and cascade == 2025-04-26 == * 00:05 bd808: Punched a hole in the beta cluster network blocks to allow 38.242.176.0/22 through. == 2025-04-24 == * 19:54 thcipriani: deployment-cache-text08: systemctl reload varnish-frontend following puppet run change to /etc/varnish/blocked-nets.inc.vcl * 19:49 thcipriani: deployment-cache-text08: sudo puppet-run to pick up https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/42c7880be27913c9e841642d9ff3e50deb455e08 * 15:32 bd808: Punched a hole in the beta cluster network blocks to allow 47.144.0.0/12 through. ([[phab:T392534|T392534]]) * 14:41 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (production) * 14:34 dancy: Updating runners to v17.9.3 in gitlab-cloud-runners (staging) == 2025-04-23 == * 22:59 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:43 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 22:15 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up a huge pile of new blocks ([[phab:T392534|T392534]]) * 22:11 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Switch Node 20 CI on, for [[phab:T382177|T382177]] * 21:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 21:29 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 20:47 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392534|T392534]]) * 17:43 James_F: Zuul: [mediawiki/services/parsoid/testreduce] Disable CI for now, for [[phab:T382177|T382177]] * 16:57 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/commit/a80e5211100f1cc42e4ae020d4266ea22938eb5a ([[phab:T383097|T383097]]) * 14:25 James_F: Zuul: [wikimedia/portals] Switch to Node 20, for [[phab:T382179|T382179]] == 2025-04-17 == * 10:15 hashar: gerrit: reparented apps.git to All-Archived-Projects.git in order to BLOCK `mediawiki-replication`. I have also archived all subprojects # [[phab:T392198|T392198]] == 2025-04-16 == * 20:59 bd808: Blocked 193.43.72.0/24 and 14.160.0.0/11 because beta was very, very sad * 16:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst non-voting for now * 09:20 hashar: integration: restarted integration-puppetserver-01 == 2025-04-15 == * 22:02 James_F: Zuul: [mediawiki/extensions/WikiLambda] Make Catalyst job voting, for [[phab:T368002|T368002]] * 19:40 bd808: Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks ([[phab:T392003|T392003]]) * 18:11 bd808: `bd808@deployment-cache-text08:~$ sudo service varnish-frontend restart` ([[phab:T392003|T392003]]) * 18:06 bd808: `sudo puppet agent -tv` on deployment-cache-text08 to update varnish deny list ([[phab:T392003|T392003]]) * 17:30 bd808: `shutdown -r now` on deployment-mediawiki14. Load has been growing for ~2 days. == 2025-04-11 == * 19:53 James_F: Zuul: [oojs/router] Mark as archived, for [[phab:T391709|T391709]] * 14:00 hashar: restarted integration-puppetserver: jvm went out of memory == 2025-04-10 == * 23:40 bd808: Removed wikifunctions from deployment-cache prefix puppet's profile::cache::haproxy::available_unified_certificates::server_names. https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/6af09ceaa6d261c910fb4b42d7b3e8b8172c8041%5E%21/ * 23:36 bd808: Deleted m.wikifunctions.beta.wmflabs.org, *.wikifunctions.beta.wmflabs.org, and wikifunctions.beta.wmflabs.org DNS records per [[Special:Diff/2292116]]. All 3 were pointing to 185.15.56.36. * 14:16 hashar: deployment-prep: `profile::mediawiki::php::increase_open_files: True` on https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-mediawiki # [[phab:T389422|T389422]] * 14:03 James_F: [Beta Cluster] On deployment-deploy04, running DELETE FROM localuser WHERE lu_wiki='wikifunctionswiki'; and DELETE FROM localnames WHERE ln_wiki='wikifunctionswiki'; for [[phab:T391511|T391511]] == 2025-04-08 == * 22:39 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135128 * 22:15 bd808: Manually deleted 'deployment-wikikube-v127' Magnum cluster template via Horizon. Deletion via OpenTofu has timed out repeatedly. * 22:08 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1135123 * 22:02 brennen: Updating docker-pkg files on contint primary for [[phab:T383065|T383065]] * 21:11 James_F: Beta Cluster: Shutting of deployment-docker-wikifunctions01, we decom'ing it. * 20:44 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1135098 == 2025-04-07 == * 17:20 bd808: `service navtiming stop` to halt "Unhandled exception in main loop, restarting consumer" crash loop ([[phab:T391272|T391272]]) * 17:15 bd808: Reboot deployment-webperf21 ([[phab:T391272|T391272]]) * 16:58 bd808: `puppet agent -tv` to catch up with missed puppet runs on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:56 bd808: `rm /var/log/user.log.1` on deployment-webperf21 ([[phab:T391272|T391272]]) * 16:47 bd808: `sudo /usr/local/sbin/clean-stale-puppet-certs --clean` on deployment-puppetserver-1 to clean up dangling certs for deployment-elastic<nowiki>{</nowiki>09,10,11<nowiki>}</nowiki> == 2025-04-04 == * 09:42 Lucas_WMDE: ssh integration-castor05.integration.eqiad1.wikimedia.cloud sudo -u jenkins-deploy rm -rf /srv/castor/castor-mw-ext-and-skins/master/mwgate-node20 # fix failure seen in mwgate-node20 35782 and 35784 * 09:09 hashar: Update tox jobs to default to python 3.9 {{!}} https://gerrit.wikimedia.org/r/c/integration/config/+/1134168 * 08:53 hashar: Updating Quibble jobs to catch up with latest image https://gerrit.wikimedia.org/r/c/integration/config/+/1134167 {{!}} [[phab:T3666646|T3666646]] * 00:35 thcipriani: integration-agent-docker-1041 marked offline due to /srv disk space * 00:09 Krinkle: Disable duplicate publishing noise from extension-MediaUploader, ref [[phab:T143162|T143162]], [[phab:T389450|T389450]] == 2025-04-03 == * 15:06 James_F: Zuul: Configure the REL1_44 test and gate pipelines, for [[phab:T390695|T390695]] * 13:33 James_F: Docker: [quibble-bullseye] Revert MardiaDB to 10.5, for (against) [[phab:T366646|T366646]] * 13:08 James_F: Zuul: [mediawiki/extensions/MetricsPlatform] Publish JS docs == 2025-04-02 == * 13:39 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133383 [[phab:T390754|T390754]] * 12:36 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133379 https://gerrit.wikimedia.org/r/1133380 * 12:20 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133373 == 2025-04-01 == * 20:46 James_F: Zuul: Swap the branch check to specific release branches, for [[phab:T390754|T390754]] etc. * 20:34 James_F: Docker: [quibble-bullseye] Switch MariaDB to 10.6 Wikimedia package, for [[phab:T366646|T366646]] * 20:26 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133238 * 20:09 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133231 * 19:31 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133221 [[phab:T390754|T390754]] * 18:40 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133209 [[phab:T390772|T390772]] * 16:53 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1133184 [[phab:T390754|T390754]] == 2025-03-31 == * 18:26 dancy: Reloading Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1132688 * 15:20 James_F: Zuul: [mediawiki/extensions/EmailAuth] Mark as in Wikimedia production, move up, for [[phab:T390437|T390437]] * 11:08 dcausse: [[phab:T389971|T389971]]: deleting deployment-elastic* VMs in deployment-prep * 08:24 dcausse: [[phab:T389971|T389971]]: shutting down deployment-elastic* VMs in deployment-prep == 2025-03-28 == * 22:01 Krinkle: Disable duplicate publishing noise from extension-LoginNotify, ref [[phab:T143162|T143162]], [[phab:T390315|T390315]] * 21:39 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 * 21:15 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130957 == 2025-03-27 == * 16:28 bd808: Moved Puppet configuration from deployment-cache-text08 to deployment-cache-text prefix Puppet * 16:05 bd808: `sudo systemctl restart varnish-frontend` on deployment-cache-text08 ([[phab:T390209|T390209]]) * 15:05 bd808: Moved role::acme_chief::cloud from individual instance config to deployment-acme-chief Puppet prefix. * 00:55 bd808: Removed prefix puppet classes for deployment-acme-chief ([[phab:T390128|T390128]]) == 2025-03-26 == * 20:23 inflatador: bking@deployment-prep populating new OpenSearch cluster indices a la https://wikitech.wikimedia.org/w/index.php?title=Search&oldid=2164435#Adding_new_wikis [[phab:T389971|T389971]] * 17:10 inflatador: bking@deployment-prep reverted an accident replacement of deployment-acme-chief.yaml [[phab:T389971|T389971]] * 15:04 dancy: Update gitlab-runners to v17.8.4 in gitlab-cloud-runners staging and production. * 00:30 bd808: Delete parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud service name again ([[phab:T389252|T389252]]) == 2025-03-25 == * 21:11 jeena: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1130722 * 04:18 jeena: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1130729 == 2025-03-24 == * 19:35 hashar: Updating Jenkins jobs for https://gerrit.wikimedia.org/r/c/integration/config/+/1130700 == 2025-03-23 == * 18:41 James_F: Zuul: Add 0xDeadbeef to CI allowlist * 18:34 James_F: Zuul: [operations/debs/bdsync] Mark as archived, for [[phab:T377882|T377882]] * 18:31 James_F: Zuul: [mediawiki/extensions/CheckUser] Add GrowthExperiments dependency, for [[phab:T386435|T386435]] * 18:29 James_F: Zuul: [mediawiki/extensions/CategoryWatch] Add Echo CI dependency == 2025-03-20 == * 23:31 bd808: integration: thcipriani added integration-agent-docker-106<nowiki>{</nowiki>0,1,2<nowiki>}</nowiki> earlier today ([[phab:T389554|T389554]]) * 22:50 brennen: integration: added jenkins nodes for integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> with 3 executors per each ([[phab:T389554|T389554]]) * 21:41 brennen: integration: launched integration-agent-docker-106<nowiki>{</nowiki>3,4,5<nowiki>}</nowiki> ([[phab:T389554|T389554]]) * 21:25 eileen: civicrm upgraded from {{Gerrit|7b532ad7}} to {{Gerrit|fba4c3d6}} * 15:13 dancy: Rebooting integration-agent-docker-1046 (Seems to be be inaccessible since February) * 08:28 taavi: reloading zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1129765 == 2025-03-19 == * 20:32 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1129364 * 00:12 bd808: Trying the simplest thing that might work by adding a CNAME record for parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. ([[phab:T389252|T389252]]) == 2025-03-18 == * 20:25 bd808: Rebooting deployment-jobrunner05 because things just seem weird ([[phab:T387631|T387631]], [[phab:T387276|T387276]]) * 15:18 sergi0: run CommunityUpdates config schema migration `foreachwikiindblist growthexperiments extensions/CommunityConfiguration/maintenance/migrateConfig.php CommunityUpdates` ([[phab:T387737|T387737]]) == 2025-03-14 == * 21:36 Reedy: deployed https://gerrit.wikimedia.org/r/1127982 * 16:55 Lucas_WMDE: manually killed job https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php81/2928/console which had been stuck since 16:33 UTC, blocking gate-and-submit :( == 2025-03-13 == * 21:29 dancy: Finished gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 20:42 dancy: Finished gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) * 20:09 dancy: Starting gitlab cloud runners k8s production cluster upgrade ([[phab:T388836|T388836]]) * 19:26 dancy: Starting gitlab cloud runners k8s staging cluster upgrade ([[phab:T388836|T388836]]) == 2025-03-11 == * 22:54 bd808: Deleted unattached volumes: alert01, db09, deploy03, mwmaint, ores02, parsoid14-srv, prometheus05 * 22:39 bd808: Released unused floating IPs 185.15.56.9 and 185.15.56.97 back to global pool * 22:08 bd808: Updated mail.beta.wmflabs.org service name to point to 185.15.56.115 * 22:04 bd808: Deleted orphan parsoid-external-ci-access.beta.wmflabs.org. DNS record * 21:53 bd808: Deleted dangling prometheus-beta.wmcloud.org web proxy * 21:50 bd808: Deleted dangling w-beta.wmflabs.org web proxy * 21:42 bd808: Deleted unused "deployment-parsoid" Prefix Puppet configuration * 20:48 James_F: Docker: [quibble-bullseye-php81 & php81] Use PCRE2 backport from component/php81, for [[phab:T386006|T386006]] * 13:19 James_F: Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived, for [[phab:T382069|T382069]] * 03:54 eileen: civicrm upgraded from {{Gerrit|f2222fcd}} to {{Gerrit|ec20a105}} == 2025-03-10 == * 15:20 James_F: Zuul: [mediawiki/services/servicelib-node] Mark as archived, for [[phab:T388424|T388424]] * 13:47 hashar: gerrit: removed leftover empty directory `/srv/gerrit/plugins/lfs`. Data have been migrated to `/srv/gerrit/plugins/lfs` as part of moving Gerrit data out of `/`. See [[phab:T333143|T333143]] == 2025-03-08 == * 01:22 James_F: Zuul: [php-session-serializer] Enable PHP 8.4 as voting, for [[phab:T368270|T368270]] == 2025-03-07 == * 21:00 James_F: Zuul: [mediawiki/libs/Shellbox] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:53 James_F: Zuul: [wikipeg] Enable PHP 8.4 as voting, for [[phab:T386570|T386570]] * 20:07 James_F: Zuul: [mediawiki/libs/Equivset] Enable PHP 8.4 as voting, for [[phab:T387806|T387806]] == 2025-03-05 == * 00:21 dancy: Reeanbled beta-scap-sync-world ([[phab:T166010|T166010]]) == 2025-03-04 == * 23:26 dancy: Disabling beta-scap-sync-world for noise reduction while dealing with [[phab:T166010|T166010]] * 22:10 James_F: Zuul: [mediawiki/services/example-node-api] Mark as archived, for [[phab:T387933|T387933]] * 01:42 James_F: Zuul: [mediawiki/tools/phan/SecurityCheckPlugin] Disable on PHP 8.4, for [[phab:T386570|T386570]] * 01:13 James_F: Zuul: Add WgevaertWikiBase to CI allowlist * 01:03 James_F: Zuul: Start testing in PHP 8.4 for 'mediawiki-php-library' repos, for [[phab:T386108|T386108]] == 2025-02-28 == * 18:20 dancy: Upgrading gitlab-runner to v17.7.1 in production gitlab-cloud-runners ([[phab:T386297|T386297]]) * 18:12 dancy: Upgrading gitlab-runner to v17.7.1 in staging gitlab-cloud-runners ([[phab:T386297|T386297]]) * 17:52 dancy: Upgraded scap to 4.138.0 in beta cluster * 16:43 bd808: Deleted now dangling parsoid.svc.deployment-prep.eqiad1.wikimedia.cloud. DNS record ([[phab:T385849|T385849]]) * 16:40 bd808: Deleted deployment-parsoid14.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:39 bd808: Deleted parsoid-external-ci-access.wmcloud.org proxy ([[phab:T385849|T385849]]) * 16:37 bd808: Deleted deployment-alert01.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) * 16:36 bd808: Deleted deployment-bastion.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T385849|T385849]]) == 2025-02-27 == * 01:11 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1123063 [[phab:T386476|T386476]] == 2025-02-26 == * 20:21 James_F: jforrester@doc1003:~$ sudo -u doc-uploader rm -rf /srv/doc/cover-extensions/LdapAuthentication/ #[[phab:T376097|T376097]] * 20:18 James_F: Zuul: [mediawiki/extensions/LdapAuthentication] Mark as archived, for [[phab:T376097|T376097]] * 13:20 hashar: Updating Quibble jobs to 1.13.0. "Skip execution upon a success cache hit" which would make some jobs to skip tests entirely when a set of commits/image is known to have previously passed # [[phab:T383243|T383243]] {{!}} dduvall * 11:06 hashar: Tag Quibble 1.13.0 @ {{Gerrit|0ac128f7bc060c82f11317aabaf78a10b24aeeec}} # [[phab:T383243|T383243]] * 09:11 hashar: deployment-prep: cherry picking https://gerrit.wikimedia.org/r/c/operations/puppet/+/1122901 "php: use component/pcre2 when using Php 8.1" to fix php # [[phab:T387276|T387276]] * 01:55 bd808: `./jjb-update 'integration-quibble-fullrun-*-php81' '*-php81-phan' '*php81*'` * 01:16 Reedy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/1122700 [[phab:T386006|T386006]] == 2025-02-25 == * 20:25 James_F: Docker: [php81] Update PHP to 8.1.31-1+wmf11u4, for [[phab:T386006|T386006]] * 14:07 James_F: Docker: [php81] Upgrade Wikimedia's PHP to 8.1.31-1+wmf11u3 & PCRE to 10.42 for [[phab:T386006|T386006]] == 2025-02-24 == * 01:02 jeena: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/73 == 2025-02-22 == * 11:27 taavi: rebooting integration-agent-docker-1047 which thinks it is gerrit == 2025-02-21 == * 22:54 brennen: gitlab: removing expiration date for 14 tokens expiring in 2025 ([[phab:T385930|T385930]]) * 22:36 brennen: gitlab: set require_personal_access_token_expiry and service_access_tokens_expiration_enforced to false == 2025-02-20 == * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners ([[phab:T386955|T386955]]) * 20:15 dancy: Updated buildkitd to v0.20.0 in gitlab-cloud-runners == 2025-02-19 == * 21:28 dancy: Reenabled https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-sync-world/ ([[phab:T386851|T386851]]) * 19:35 dduvall: restarting jenkins to fix git related issues following java update ([[phab:T386755|T386755]]) * 15:47 dancy: Disabled the https://integration.wikimedia.org/ci/job/beta-scap-sync-world/ job to reduce noise while the problem is being debugged. == 2025-02-18 == * 16:49 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1119815 * 16:11 James_F: Zuul: [operations/debs/dnsdist] Revert archival == 2025-02-13 == * 13:57 James_F: Zuul: [mediawiki/extensions/CirrusSearch] Drop WikibaseCirrusSearch dep, for [[phab:T386015|T386015]] == 2025-02-12 == * 17:22 James_F: Zuul: Add User:Michi j to CI allowlist * 17:21 James_F: Zuul: Add Dragoniez to CI allowlist == 2025-02-11 == * 15:43 James_F: Zuul: Make PHP 8.4 voting on lib repos where it already passes, for [[phab:T386108|T386108]] == 2025-02-10 == * 14:27 James_F: Zuul: Add Bunnypranav to CI allowlist == 2025-02-08 == * 00:07 bd808: Added `profile::maps::osm_master::disable_waterlines_import_timer: false` to deployment-maps prefix hiera ([[phab:T385921|T385921]]) == 2025-02-07 == * 22:14 brennen: phab/phorge: replaced mr-widget token in deployed config ([[phab:T385480|T385480]]) * 21:33 bd808: Added `profile::restbase::parsoid_uri: https://phabricator.wikimedia.org/T385902` to deployment-restbase prefix puppet ([[phab:T385902|T385902]]) * 01:34 bd808: Cherry-picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/1117997 to deployment-puppetmaster ([[phab:T385849|T385849]]) * 00:42 bd808: Shutoff deployment-parsoid14 to see if anything breaks/anyone yells ([[phab:T385849|T385849]]) == 2025-02-06 == * 23:53 bd808: Updated citoid-beta.wmflabs.org to point to deployment-docker-citoid02 * 23:50 bd808: Deleted beta-prometheus.wmflabs.org; it was pointed to an IP now owned by the mdwikioffline project. * 23:43 bd808: Deleted recently orphaned spiderpig.wmcloud.org proxy after discussion with dancy * 16:20 bd808: Rebooted deployment-sessionstore06 ([[phab:T385803|T385803]]) * 12:07 andrewbogott: rebooting all servers for [[phab:T385264|T385264]] == 2025-02-05 == * 19:17 James_F: Zuul: [mediawiki/extensions/DonationInterface] Switch CI from PHP74 to PHP82 * 18:23 James_F: Zuul: [mediawiki/extensions/cldr] Raise FR-special job to REL1_43 * 18:22 James_F: Zuul: [mediawiki/extensions/DonationInterface] Raise FR-special job to REL1_43 * 18:11 James_F: Zuul: [labs/tools/heritage] Fold template into this, only user * 18:08 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Test in PHP 8.2+ only * 17:29 James_F: Zuul: [mediawiki/core] Test fundraising branches against PHP 8.2 * 17:19 James_F: Zuul: [mediawiki/extensions/FundraisingEmailUnsubscribe] Mark as non-prod == 2025-02-03 == * 12:34 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115782 == 2025-01-30 == * 15:12 James_F: Zuul: [mediawiki/extensions/Wikibase] Only inject EntitySchema on 1.43+, for [[phab:T385175|T385175]] * 01:39 James_F: Zuul: [mediawiki/core] Remove composer variant from wmf branches * 00:42 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115131 == 2025-01-29 == * 18:03 James_F: Zuul: Make FR REL1_43-php82 voting for cldr and FEU * 17:54 James_F: Zuul: Add FR REL1_43-php82 as experimental to other extensions * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Add FR REL1_43-php82 as experimental * 17:40 James_F: Zuul: [mediawiki/extensions/cldr] Re-enable FR-tech job as voting, passes fine * 16:57 Reedy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/1115064 * 16:33 hashar: gerrit: marked all legacy Puppet modules as read-only ( https://gerrit.wikimedia.org/r/admin/repos/q/filter:operations/puppet/ ) and removed the associated GitHub mirrors that existed for some of them == 2025-01-28 == * 17:46 dancy: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/c/integration/config/+/1113550 ([[phab:T383337|T383337]]) * 17:38 dancy: Updating docker-pkg files on contint primary for https://gerrit.wikimedia.org/r/c/integration/config/+/1113549 ([[phab:T383337|T383337]]) * 10:07 hashar: Manually cleaned integration-agent-docker-1043 == 2025-01-27 == * 18:17 hashar: Cleaned disk on integration-agent-docker-1051 == 2025-01-25 == * 09:20 taavi: reloading zuul for https://gerrit.wikimedia.org/r/1113739 == 2025-01-24 == * 21:44 James_F: Revert "Zuul: Switch Fundraising jobs to REL1_43" == 2025-01-23 == * 16:31 dancy: Updating production gitlab-cloud-runners to v17.6.1 * 16:23 dancy: Updating staging gitlab-cloud-runners to v17.6.1 == 2025-01-22 == * 18:14 James_F: Zuul: [mediawiki/extensions/WikiLambda] Add Wikibase as a phan dependency == 2025-01-20 == * 09:55 hashar: Updating Quibble jobs to enable success cache experiment - [[phab:T383243|T383243]] * 08:20 hashar: Updating all Jenkins jobs to update Quibble to 1.12.0 == 2025-01-17 == * 16:59 dduvall: Building Docker images for Quibble 1.12.0 * 15:00 hashar: Building Docker images for Quibble 1.12.0 * 12:56 hashar: Tag Quibble 1.12.0 @ {{Gerrit|633099ead3ec72180e7890e1980074b4fde56c26}} # [[phab:T365978|T365978]], [[phab:T383243|T383243]] == 2025-01-14 == * 17:14 brennen: integration project: create integration-agent-docker-1059 for [[phab:T383254|T383254]] * 16:50 brennen: integration project: create integration-agent-docker-1058 for [[phab:T383254|T383254]] == 2025-01-10 == * 15:55 dancy: Updating gitlab-cloud-runners (prod) to v17.5.5 ([[phab:T383263|T383263]]) * 15:49 dancy: Updating gitlab-cloud-runners (staging) to v17.5.5 == 2025-01-09 == * 22:20 brennen: gitlab: Feature.enable(:kubernetes_agent_protected_branches) - https://docs.gitlab.com/ee/user/clusters/agent/ci_cd_workflow.html#restrict-access-to-the-agent-to-protected-branches * 18:08 James_F: Docker: [node22] Update Node to v22.13.0, & switch base image to bookworm, for [[phab:T383337|T383337]] * 17:01 James_F: Docker: [node20] Update Node to v20.18.1, & switch base image to bookworm, for [[phab:T383337|T383337]] * 15:13 James_F: Docker: [sury-php] Re-platform to bookworm == 2025-01-08 == * 22:07 hashar: castor: deleting potentially corrupted npm cache. On integration-castor05: sudo rm -fR /srv/castor/castor-mw-ext-and-skins/master/<nowiki>{</nowiki>wmf-quibble-selenium-php74,quibble-vendor-mysql-php74-selenium<nowiki>}</nowiki>/npm # [[phab:T383237|T383237]] == 2025-01-07 == * 22:07 hashar: Deleted /srv/zuul/git/operations/dumps/dcat on both contint1002 and contint2002 # [[phab:T157818|T157818]] * 19:00 bd808: `/usr/local/sbin/clean-stale-puppet-certs --clean` ([[phab:T383153|T383153]]) * 18:53 taavi: taavi@deployment-puppetserver-1:~$ sudo puppetserver ca clean --certname maps-master01.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:50 taavi: taavi@deployment-puppetserver-1:~$ sudo puppet node clean geoshapes.maps-experiments.eqiad1.wikimedia.cloud # [[phab:T383153|T383153]] * 18:30 bd808@cloudcumin1001: END (FAIL) - Cookbook wmcs.vps.remove_instance (exit_code=1) for instance deployment-etcd04 * 18:30 bd808@cloudcumin1001: START - Cookbook wmcs.vps.remove_instance for instance deployment-etcd04 * 14:48 hashar: Manually renamed wikibase-daily-npm-audit-daily-node18-npmaudit to node20 variant and refresh the config with JJB * 14:33 James_F: Zuul: [mediawiki/extensions/WikiLambda] Only run standalone jobs in master == 2025-01-06 == * 20:16 andrewbogott: removed the (non-existent?) role::mw_rc_irc from puppet config for deployment-ircd03.deployment-prep.eqiad1.wikimedia.cloud * 19:35 bd808: Manually generated missing en_US.UTF-8 locale on deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:32 bd808: Added `postgresql::postgis::postgresql_postgis_package: postgresql-15-postgis-3` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:31 bd808: Issued new Puppet cert for deployment-maps-master02.deployment-prep.eqiad1.wikimedia.cloud ([[phab:T361381|T361381]]) * 19:27 bd808: Added `postgresql::postgis::postgresql_postgis_package: ignored` to deployment-maps Prefix Puppet to work around default parameter problem ([[phab:T361381|T361381]]) * 19:15 brennen: Updating development images on contint primary for https://gitlab.wikimedia.org/repos/releng/dev-images/-/merge_requests/71 ([[phab:T382709|T382709]]) * 19:11 bd808: Added placeholders for `graphite_host` and `statsd` to deployment-webperf Prefix Puppet * 18:53 bd808: Fixed missing profile::swift::global_account_keys::<nowiki>{</nowiki>codfw, eqiad<nowiki>}</nowiki> placeholders breaking deployment-ms-* puppet runs * 18:38 bd808: Fixed incorrect deployment-restbase prefix puppet setting that was causing puppet run failures * 18:19 bd808: Issued a new Puppet client cert for traindev01.deployment-prep.eqiad1.wikimedia.cloud * 14:58 James_F: Zuul: Drop CI for REL1_41 branch, now EOL per [[phab:T376550|T376550]] * 09:03 hashar: gerrit: flushed diff_intraline, diff_summary, gerrit_file_diff and git_file_diff caches after having turned on diff3 style # [[phab:T359821|T359821]] == 2025-01-02 == * 11:27 hashar: Reloaded Zuul for https://gerrit.wikimedia.org/r/c/integration/config/+/1105679 # [[phab:T374113|T374113]] {{SAL-archives/Release Engineering}} <noinclude>[[Category:SAL]]</noinclude> frlfp62lcm570tb8v3rd9w18uil9qaq Portal:Toolforge/Admin 0 21478 2309688 2300256 2025-06-09T09:55:01Z FNegri-WMF 32595 /* Granting a tool write access to Elasticsearch */ fix steps for editing hiera 2309688 wikitext text/x-wiki {{Toolforge nav_admin|nocat=1}} Documentation of backend components and '''admin procedures for Toolforge'''. See [[Help:Toolforge]] for user facing documentation about actually using Toolforge to run your bots and webservices. == Admin permissions == Performing admin procedures requires having admin permissions on Toolforge. There is not a single "admin" flag, but a set of interrelated permissions you can be granted. These are described in detail in the page [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]]. == Failover == Tools should be able to survive the failure of any one virt* node. Some items may need manual failover === WebProxy === {{tracked|T283948}} The front web proxy is now a stateless web service. There are two <code>tools-proxy-N</code> VMs in the <code>tools</code> project, which previously ran [[Obsolete:Portal:Toolforge/Admin/Dynamicproxy|Dynamicproxy]] and nowadays just proxy everything to the [[Portal:Toolforge/Admin/Kubernetes/New cluster#front_proxy_%28haproxy%29|K8s HAProxies]]. The only meaningful thing that currently happens on them is the toolviews counting based on the access logs. Otherwise we could remove those nodes and just point to HAProxy. In case one VM is not working correctly, we can failover from one VM to the other, which can be done by manually reassigning the floating IP in Horizon or from the OpenStack CLI. {{Note|This is a different proxy from the [[Portal:Cloud VPS/Admin/Web proxy|Cloud VPS Web Proxy]].}} === Static webserver === This is a stateless simple nginx http server. Simply switch the floating IP from tools-static-10 to tools-static-11 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok. === Checker service === This is the service that Icinga hits to check status of several services. It's totally stateless. See [[Portal:Toolforge/Admin/Toolschecker]] === Redis === Redis uses [[Portal:Toolforge/Admin/Redis|Sentinel]] to automatically fail over in case of a node failure. === Prometheus === See [[Portal:Toolforge/Admin/Prometheus#Failover]]. === Services === Service nodes run the Toolforge internal '''aptly''' service, to serve .deb packages as a repository for all the other nodes. == Command orchestration == Toolforge and Toolsbeta both have a local [[cumin]] server. == Administrative tasks == === Logging in as root === For normal login root access see [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]]. In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with <code>ssh-keygen</code>, add an entry to the <code>passwords::root::extra_keys</code> hash in Horizon's 'Project Puppet' section with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the <code>root</code> accounts. Add to your <code>~/.ssh/config</code>: <pre> # Use different identity for Tools root. Match host *.tools.eqiad1.wikimedia.cloud user root IdentityFile ~/.ssh/your_secret_root_key </pre> The code that reads <code>passwords::root::extra_keys</code> is in [https://phabricator.wikimedia.org/diffusion/LPRI/browse/master/modules/passwords/manifests/init.pp labs/private:modules/passwords/manifests/init.pp]. === Disabling all ssh logins except root === Useful for dealing with security critical situations. Just touch <code>/etc/nologin</code> and PAM will prevent any and all non-root logins. === Complaints of bastion being slow === {{Tracked|T266300}} Users are increasingly noticing slowness on tools-login due to either CPU or IOPS exhaustion caused by people running processes there instead of on Kubernetes. Here are some tips for finding the processes in need of killing: * Look for IOPS hogs ** <syntaxhighlight inline lang="shell-session">$ iotop</syntaxhighlight> * Look for abnormal processes: ** <syntaxhighlight inline lang="shell-session">$ ps axo user:32,pid,cmd | grep -Ev "^($USER|root|daemon|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data)" | grep -ivE 'screen|tmux|-bash|mosh-server|sshd:|/bin/bash|/bin/zsh'</syntaxhighlight> ** If you see <code>pyb.py</code> kill with extreme prejudice. * If the rogue job is running as a tool, <code>!log</code> something like: <code><nowiki>!log tools.$TOOL Killed $PROC process running on tools-bastion-NN. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for instructions on running jobs on Kubernetes.</nowiki></code> === Local package management === Local packages are provided by an <code>aptly</code> repository on <code>tools-services-05</code>. On <code>tools-services-05</code>, you can manipulate the package database by various commands; cf. <code>aptly(1)</code>. Afterwards, you need to publish the database to the file <code>Packages</code> by (for the <code>trusty-tools</code> repository) <code>aptly publish --skip-signing update trusty-tools</code>. To use the packages on the clients you need to wait 30 minutes again or run <code>apt-get update</code>. In general, you should never just delete packages, but move them to <code>~tools.admin/archived-packages</code>. You can always see where a package is (would be) coming from with <code>apt-cache showpkg $package</code>. ==== Local package policy ==== '''Package repositories''' * We only install packages from trustworthy repositories. ** OK are *** The official Debian and Ubuntu repositories, and *** Self-built packages (apt.wikimedia.org and aptly) ** Not OK are: *** PPAs *** other 3rd party repositories ''Packagers effectively get root on our systems, as they could add a rootkit to the package, or upload an unsafe sshd version, and apt-get will happily install it'' Hardness clause: in extraordinary cases, and for 'grandfathered in' packages, we can deviate from this policy, as long as security and maintainability are kept in mind. '''apt.wikimedia.org''' We assume that whatever is good for production is also OK for Toolforge. '''aptly''' We manage the aptly repository ourselves. * Packages in aptly need to be built by Toolforge admins ** we cannot import .deb files from untrusted 3rd party sources * Package source files need to come from a trusted source ** a source file from a trusted source (i.e. backports), or ** we build the debian source files ourselves ** we cannot build .dcs files from untrusted 3rd party sources * Packages need to be easy to update and build ** cowbuilder/pdebuild OK ** fpm is OK ** See [[Nova Resource:Tools/Admin/Deploy new jobutils package|Deploy new jobutils package]] for an example walk through of building and adding packages to aptly. * We only package if strictly necessary ** infrastructure packages ** packages that should be available for effective development (e.g. composer or sbt) ** not: python-*, lib*-perl, ..., which should just be installed with the available platform-specific package managers * For each package, it should be clear who is responsible for keeping it up to date ** for infrastructure packages, this should be one of the paid staffers A list of locally maintained packages can be found under [[Nova Resource:Tools/Admin/local packages|/local packages]]. ==== Building packages ==== {{note| moved to [[Portal:Toolforge/Admin/Packaging]]}} ==== Deploy new misctools package ==== {{note | moved to [[Portal:Toolforge/Admin/Packaging]] }} ==== Testing/QA for a new tools-webservice package ==== See also [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tools-webservice/+/refs/heads/master/README tools-webservice source tree README]. There is a simple flask app in toolsbeta using the tool <code>test</code> that is set up to be deployed via webservice on Kubernetes. After running <code>become test</code>, you can go to the <code>qa/tools-webservice</code> directory. This is checked out via anonymous https, and is suitable for checking out a patch you are reviewing. There is an untracked file in there that is useful here, usually. The webservicefile at the route is just a copy of the one in the <code>scripts</code> folder in the repo. The only difference is: <syntaxhighlight lang=diff> 9d8 < sys.path.insert(0, '') </syntaxhighlight> That exchanges the distribution installed package in the python path for the local directory, so if you run <code>./webservice $somecommand</code> it will run what is in your local folder rather than what is in <code>/usr/lib/python3/dist-packages/</code>. If you are testing changes made directly to <code>scripts/webservice</code> in the repo, you will likely need to copy that over the file and add <code>sys.path.insert(0, "")</code> after the import sys line. If there is no <code>import sys</code> line in this version of the code, add one! This should let you bang on your new version without having to mess with packaging, yet. ==== Deploy new tools-webservice package ==== {{note | moved to [[Portal:Toolforge/Admin/Packaging]] }} === Webserver statistics === To get a look at webserver statistics, [//goaccess.io goaccess] is installed on the webproxies. Usage: <pre>goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log</pre> Interactive key bindings are documented on [http://goaccess.io/man#interactive-keys the man page]. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available. === Banning an IP from tool labs === On [[Hiera:Tools]], add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to [[Help:Toolforge/Banned]] explaining why. The user will get a message like [https://toolforge.org/.error/banned.html]. === Deploying the main web page === This website (plus the 403/500/503 error pages) are hosted under <code>tools.admin</code>. To deploy,<syntaxhighlight lang="shell-session"> $ become admin $ cd tool-admin-web $ git pull </syntaxhighlight> === Regenerate replica.my.cnf === {{See also|Portal:Data Services/Admin/Wiki Replicas#Account management (maintain-dbusers)}} This requires access to the cloudcontrol host [https://codesearch.wmcloud.org/search/?q=maintain_dbusers_primary&files=&excludeFiles=&repos= which is running maintain-dbusers], and can be done as follows: <syntaxhighlight lang="shell-session"> $ ssh cloudcontrolXXXX.eqiad.wmnet $ sudo /usr/local/sbin/maintain-dbusers delete tools.${NAME} --account-type=tool :# or $ sudo /usr/local/sbin/maintain-dbusers delete ${USERNAME} --account-type=user </syntaxhighlight> Once the account has been deleted, the maintain-dbusers service will automatically recreate the user account. ==== Debugging bad MariaDB credentials ==== {{anchor|Debugging bad mysql credentials}} Sometimes things go wrong and a user's <code>replica.my.cnf</code> credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong. The database credentials needed are in <code>/etc/dbusers.yaml</code> on the cloudcontrol host running maintain-dbusers. <syntaxhighlight lang="shell-session"> $ ssh cloudcontrolXXXX.eqiad.wmnet $ sudo cat /etc/dbusers.yaml :# look for the accounts-backend['password'] for the m5-master connections (user: labsdbaccounts) :# look for the labsdbs['password'] for the other connections (user: labsdbadmin) $ CHECK_UID=u12345 # User id to check for :# Check if the user is in our meta datastore $ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='${CHECK_UID}'\G" :# Check if all the accounts are created in the labsdb boxes from meta datastore. $ ACCT_ID=.... # Account_id is foreign key (id from account table) $ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=${ACCT_ID}\G" :# Check the actual labsdbs if needed $ mariadb -h clouddbXXXX.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";' :# Resynchronize account state on the replicas by finding missing GRANTS on each db server $ sudo maintain-dbusers harvest-replicas </syntaxhighlight> See [[phab:T183644]] for an example of fixing automatic credential creation caused when a old LDAP user becomes a Toolforge member and has an untracked user account on toolsdb. === Regenerate kubernetes credentials for tools (.kube/config) === With admin credentials (root on a control plane node will do), run <code>kubectl -n tool-<toolname> delete cm maintain-kubeusers-<toolname></code>; it should get regenerated within minutes. === Adding K8S Components === See [[Portal:Toolforge/Admin/Kubernetes#Building_new_nodes]] === Deleting a tool === {{Tracked|T170355|Resolved}} For batch or CLI deletion of tools, use the 'mark_tool' command on a cloudcontrol node: [[File:Tool disable process.png|thumb|The awful truth about tool deletion]] <syntaxhighlight lang="shell-session"> andrew@cloudcontrol1003:~$ sudo mark_tool usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD] [--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable] [--delete] [--enable] tool mark_tool: error: the following arguments are required: tool </syntaxhighlight> Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. In either case, the immediate effect of disabling a tool is to stop any running jobs, prevent users from logging in as that tool, and schedule archiving and deletion for 40 days in the future. [[File:Tool restore process.png|thumb|A tool can be restored within 40 days of being disabled]] Tool archives are stored on the tools NFS server, currently <code>tools-nfs-2.tools.eqiad1.wikimedia.cloud</code>: <syntaxhighlight lang="shell-session"> root@labstore1004:/srv/disable-tool# ls -ltrah /srv/tools/archivedtools/ total 1.8G drwxr-xr-x 5 root root 4.0K Jun 21 19:37 .. -rw-r--r-- 1 root root 102K Jul 22 22:15 andrewtesttooltwo -rw-r--r-- 1 root root 45 Oct 13 00:47 andrewtesttooltwo.tgz -rw-r--r-- 1 root root 8.3M Oct 13 03:20 mediaplaycounts.tgz -rw-r--r-- 1 root root 1.8G Oct 13 04:01 projanalysis.tgz -rw-r--r-- 1 root root 1.3M Oct 13 21:05 reportsbot.tgz drwxr-xr-x 2 root root 4.0K Oct 13 21:10 . -rw-r--r-- 1 root root 719K Oct 13 21:10 wsm.tgz -rw-r--r-- 1 root root 4.8K Oct 13 21:20 andrewtesttoolfour.tgz </syntaxhighlight> The actual deletion process is shockingly complicated. A tool will only be archived and deleted if all of the prior steps succeed, but disabling of a tool should be a sure thing. === SSL certificates === See [[Portal:Toolforge/Admin/SSL_certificates]]. === Granting a tool write access to Elasticsearch === * Generate a random password and the mkpassword crypt entry for it using the script [[phab:P4372|new-es-password.sh]]. (Must be run a host with the `mkpasswd` command installed. (The mkpasswd is part of the whois Debian package.) <syntaxhighlight lang="shell-session"> $ ./new-es-password.sh tools.example tools.example elasticsearch.ini ---- [elasticsearch] user=tools.example password=A3rJqgFKxa/x4NlnIhmw2cXcV92it/Zv0Yt+a7yhxCw= ---- tools.example puppet master private (hieradata/labs/tools/common.yaml) ---- profile::toolforge::elasticsearch::haproxy::elastic_users: - name: 'tools.example' password: '$6$FYwP3wxT4K7O9EE$OA3P5972NWJVG/WUnD240sal34/dsNabbcawItevMYO9uoR.fJBrjSABex0EDW0wlkWHID1Tf4oJoiNvYFGmy/' </syntaxhighlight> * Add the private SHA512 hash to the [[Portal:Toolforge/Nodes#misc nodes|tools puppetserver]]: <syntaxhighlight lang="shell-session"> $ ssh tools-puppetserver-01.tools.eqiad1.wikimedia.cloud $ sudo -i # cd /srv/git/labs/private # vim hieradata/labs/tools/common.yaml ... merge the hiera data with the existing key... :wq # git add hieradata/labs/tools/common.yaml # git commit -m "[local] Elasticsearch credentials for $TOOL" </syntaxhighlight> * Force a puppet run on tools-elastic nodes using [[Cumin]] <syntaxhighlight lang="shell-session"> cloudcumin1001.eqiad.wmnet:~$ sudo cumin "O{project:tools name:.*elastic.*}" "run-puppet-agent" </syntaxhighlight> * Make the credentials available to the tool as [[Help:Toolforge/Envvars Service|envvars]]: <syntaxhighlight lang="shell-session"> $ ssh dev.toolforge.org $ sudo -i become example-tool $ toolforge envvars create TOOL_ELASTICSEARCH_USER Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): <insert user> $ toolforge envvars create TOOL_ELASTICSEARCH_PASSWORD Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): <insert password> </syntaxhighlight> '''Note:''' An older procedure placed the credentials in <code>/data/project/$TOOL/.elasticsearch.ini</code> instead. * Resolve the ticket! === Package upgrades === See [[Portal:Cloud_VPS/Admin/Managing_package_upgrades|Managing package upgrades]]. === Creating a new Docker image (e.g. for new versions of Node.js) === See [[Portal:Toolforge/Admin/Kubernetes#Docker_Images]] == Kubernetes == See [[Portal:Toolforge/Admin/Kubernetes]] == Build service == See [[Portal:Toolforge/Admin/Build_Service]] == Tools-mail / Exim == See [[Portal:Toolforge/Admin/Exim]] and [[Portal:Cloud_VPS/Admin/Email#Operations]] == Users and community == Some information about how to manage users and general community and their relationship with Toolforge. === Project membership request approval === User access requests show up in https://toolsadmin.wikimedia.org/tools/membership/ Some guidelines for account approvals, based on [[phab:T128158#2132893|advice from scfc]]: # If the request contains any defamatory or abusive information as part of the username(s), reason, or comments → mark as '''Declined''' and check the "Suppress this request (hide from non-admin users)" checkbox. #* You should also block the user on Wikitech and consider contacting a [[meta:Stewards|Steward]] for wider review of the SUL account. # If the user name "looks" like a bot or someone else who could not consent to the [[Wikitech:Cloud Services Terms of use|Terms of use]] and [[Help:Toolforge/Rules|Rules]] → mark as '''Declined'''. # Check the status of the associated [[meta:Help:Unified_login|SUL account]]. If the user is banned on one or more wikis → mark as '''Declined'''. # If the stated purpose is "tangible" ("I want to move my bot x to Toolforge", "I want to build a web app that does y", etc.) → mark as '''Approved'''. #* If you know that someone else has been working on the same problem, add a message explaining who the user should contact or where they might find more information. # If the stated purpose is "abstract" ("research", "experimentation", etc.) and there is a hackathon ongoing or planned, the user has a non-throw-away mail address, the user has created a user page with coherent information about theirself or linked a SUL account of good standing, etc. → mark as '''Approved'''. # Otherwise add a comment asking for clarification of their reason for use and mark as '''Feedback needed'''. The request is not really "denied", but more (indefinitely) "delayed". Requests left in '''Feedback needed''' for more information for more than 30 days should usually be declined with a message like "Feel free to apply again later with more complete information." === Quota management === Toolforge quotas are managed via maintain-kubeusers. * Have the user open a phabricator ticket, for the papertrail. See also [[Help:Toolforge/Kubernetes#Quotas_and_Resources]] * Send a patch for maintain-kubeusers, have it reviewed and merged: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/tools.yaml * Deploy in the cluster, using the deploy [[Portal:Cloud_VPS/Admin/Cookbooks]] == Other == === How do Toolforge web services actually work? === See [[Portal:Toolforge/Admin/Kubernetes#Ingress]] === What makes a root/Giving root access === See [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]] === Servicegroup log === tools.admin runs <code>/data/project/admin/bin/toolhistory</code>, which provides an hourly snapshot of <code>ldaplist -l servicegroup</code> as git repository in <code>/data/project/admin/var/lib/git/servicegroups</code> === Useful administrative tools === These tools offer useful information about Toolforge itself: * [[toolforge:tool-db-usage/|ToolsDB]] - Statistics about tables owned by tools * [[toolforge:k8s-stats|k8s-stats]] - examine what our tools are doing * [[toolforge:openstack-browser/project/tools|OpenStack Browser]] - examine projects, instances, web proxies, and Puppet config == Brainstorming == * [[/BotLicensing]] == Sub pages == {{Special:Prefixindex/{{FULLPAGENAME}}/|hideredirects=1|stripprefix=1}} [[Category:Toolforge|Admin]] [[Category:Toolforge admin| ]] [[Category:Cloud Services admin|Toolforge]] rjgipij6786tp83cm043kpnl4srinp7 2309692 2309688 2025-06-09T10:06:07Z FNegri-WMF 32595 /* Granting a tool write access to Elasticsearch */ update output of "toolforge envvars create" (the output was changed in T359558) 2309692 wikitext text/x-wiki {{Toolforge nav_admin|nocat=1}} Documentation of backend components and '''admin procedures for Toolforge'''. See [[Help:Toolforge]] for user facing documentation about actually using Toolforge to run your bots and webservices. == Admin permissions == Performing admin procedures requires having admin permissions on Toolforge. There is not a single "admin" flag, but a set of interrelated permissions you can be granted. These are described in detail in the page [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]]. == Failover == Tools should be able to survive the failure of any one virt* node. Some items may need manual failover === WebProxy === {{tracked|T283948}} The front web proxy is now a stateless web service. There are two <code>tools-proxy-N</code> VMs in the <code>tools</code> project, which previously ran [[Obsolete:Portal:Toolforge/Admin/Dynamicproxy|Dynamicproxy]] and nowadays just proxy everything to the [[Portal:Toolforge/Admin/Kubernetes/New cluster#front_proxy_%28haproxy%29|K8s HAProxies]]. The only meaningful thing that currently happens on them is the toolviews counting based on the access logs. Otherwise we could remove those nodes and just point to HAProxy. In case one VM is not working correctly, we can failover from one VM to the other, which can be done by manually reassigning the floating IP in Horizon or from the OpenStack CLI. {{Note|This is a different proxy from the [[Portal:Cloud VPS/Admin/Web proxy|Cloud VPS Web Proxy]].}} === Static webserver === This is a stateless simple nginx http server. Simply switch the floating IP from tools-static-10 to tools-static-11 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok. === Checker service === This is the service that Icinga hits to check status of several services. It's totally stateless. See [[Portal:Toolforge/Admin/Toolschecker]] === Redis === Redis uses [[Portal:Toolforge/Admin/Redis|Sentinel]] to automatically fail over in case of a node failure. === Prometheus === See [[Portal:Toolforge/Admin/Prometheus#Failover]]. === Services === Service nodes run the Toolforge internal '''aptly''' service, to serve .deb packages as a repository for all the other nodes. == Command orchestration == Toolforge and Toolsbeta both have a local [[cumin]] server. == Administrative tasks == === Logging in as root === For normal login root access see [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]]. In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with <code>ssh-keygen</code>, add an entry to the <code>passwords::root::extra_keys</code> hash in Horizon's 'Project Puppet' section with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the <code>root</code> accounts. Add to your <code>~/.ssh/config</code>: <pre> # Use different identity for Tools root. Match host *.tools.eqiad1.wikimedia.cloud user root IdentityFile ~/.ssh/your_secret_root_key </pre> The code that reads <code>passwords::root::extra_keys</code> is in [https://phabricator.wikimedia.org/diffusion/LPRI/browse/master/modules/passwords/manifests/init.pp labs/private:modules/passwords/manifests/init.pp]. === Disabling all ssh logins except root === Useful for dealing with security critical situations. Just touch <code>/etc/nologin</code> and PAM will prevent any and all non-root logins. === Complaints of bastion being slow === {{Tracked|T266300}} Users are increasingly noticing slowness on tools-login due to either CPU or IOPS exhaustion caused by people running processes there instead of on Kubernetes. Here are some tips for finding the processes in need of killing: * Look for IOPS hogs ** <syntaxhighlight inline lang="shell-session">$ iotop</syntaxhighlight> * Look for abnormal processes: ** <syntaxhighlight inline lang="shell-session">$ ps axo user:32,pid,cmd | grep -Ev "^($USER|root|daemon|_lldpd|messagebus|nagios|nslcd|ntp|prometheus|statd|syslog|Debian-exim|www-data)" | grep -ivE 'screen|tmux|-bash|mosh-server|sshd:|/bin/bash|/bin/zsh'</syntaxhighlight> ** If you see <code>pyb.py</code> kill with extreme prejudice. * If the rogue job is running as a tool, <code>!log</code> something like: <code><nowiki>!log tools.$TOOL Killed $PROC process running on tools-bastion-NN. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Jobs_framework for instructions on running jobs on Kubernetes.</nowiki></code> === Local package management === Local packages are provided by an <code>aptly</code> repository on <code>tools-services-05</code>. On <code>tools-services-05</code>, you can manipulate the package database by various commands; cf. <code>aptly(1)</code>. Afterwards, you need to publish the database to the file <code>Packages</code> by (for the <code>trusty-tools</code> repository) <code>aptly publish --skip-signing update trusty-tools</code>. To use the packages on the clients you need to wait 30 minutes again or run <code>apt-get update</code>. In general, you should never just delete packages, but move them to <code>~tools.admin/archived-packages</code>. You can always see where a package is (would be) coming from with <code>apt-cache showpkg $package</code>. ==== Local package policy ==== '''Package repositories''' * We only install packages from trustworthy repositories. ** OK are *** The official Debian and Ubuntu repositories, and *** Self-built packages (apt.wikimedia.org and aptly) ** Not OK are: *** PPAs *** other 3rd party repositories ''Packagers effectively get root on our systems, as they could add a rootkit to the package, or upload an unsafe sshd version, and apt-get will happily install it'' Hardness clause: in extraordinary cases, and for 'grandfathered in' packages, we can deviate from this policy, as long as security and maintainability are kept in mind. '''apt.wikimedia.org''' We assume that whatever is good for production is also OK for Toolforge. '''aptly''' We manage the aptly repository ourselves. * Packages in aptly need to be built by Toolforge admins ** we cannot import .deb files from untrusted 3rd party sources * Package source files need to come from a trusted source ** a source file from a trusted source (i.e. backports), or ** we build the debian source files ourselves ** we cannot build .dcs files from untrusted 3rd party sources * Packages need to be easy to update and build ** cowbuilder/pdebuild OK ** fpm is OK ** See [[Nova Resource:Tools/Admin/Deploy new jobutils package|Deploy new jobutils package]] for an example walk through of building and adding packages to aptly. * We only package if strictly necessary ** infrastructure packages ** packages that should be available for effective development (e.g. composer or sbt) ** not: python-*, lib*-perl, ..., which should just be installed with the available platform-specific package managers * For each package, it should be clear who is responsible for keeping it up to date ** for infrastructure packages, this should be one of the paid staffers A list of locally maintained packages can be found under [[Nova Resource:Tools/Admin/local packages|/local packages]]. ==== Building packages ==== {{note| moved to [[Portal:Toolforge/Admin/Packaging]]}} ==== Deploy new misctools package ==== {{note | moved to [[Portal:Toolforge/Admin/Packaging]] }} ==== Testing/QA for a new tools-webservice package ==== See also [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/tools-webservice/+/refs/heads/master/README tools-webservice source tree README]. There is a simple flask app in toolsbeta using the tool <code>test</code> that is set up to be deployed via webservice on Kubernetes. After running <code>become test</code>, you can go to the <code>qa/tools-webservice</code> directory. This is checked out via anonymous https, and is suitable for checking out a patch you are reviewing. There is an untracked file in there that is useful here, usually. The webservicefile at the route is just a copy of the one in the <code>scripts</code> folder in the repo. The only difference is: <syntaxhighlight lang=diff> 9d8 < sys.path.insert(0, '') </syntaxhighlight> That exchanges the distribution installed package in the python path for the local directory, so if you run <code>./webservice $somecommand</code> it will run what is in your local folder rather than what is in <code>/usr/lib/python3/dist-packages/</code>. If you are testing changes made directly to <code>scripts/webservice</code> in the repo, you will likely need to copy that over the file and add <code>sys.path.insert(0, "")</code> after the import sys line. If there is no <code>import sys</code> line in this version of the code, add one! This should let you bang on your new version without having to mess with packaging, yet. ==== Deploy new tools-webservice package ==== {{note | moved to [[Portal:Toolforge/Admin/Packaging]] }} === Webserver statistics === To get a look at webserver statistics, [//goaccess.io goaccess] is installed on the webproxies. Usage: <pre>goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log</pre> Interactive key bindings are documented on [http://goaccess.io/man#interactive-keys the man page]. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available. === Banning an IP from tool labs === On [[Hiera:Tools]], add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to [[Help:Toolforge/Banned]] explaining why. The user will get a message like [https://toolforge.org/.error/banned.html]. === Deploying the main web page === This website (plus the 403/500/503 error pages) are hosted under <code>tools.admin</code>. To deploy,<syntaxhighlight lang="shell-session"> $ become admin $ cd tool-admin-web $ git pull </syntaxhighlight> === Regenerate replica.my.cnf === {{See also|Portal:Data Services/Admin/Wiki Replicas#Account management (maintain-dbusers)}} This requires access to the cloudcontrol host [https://codesearch.wmcloud.org/search/?q=maintain_dbusers_primary&files=&excludeFiles=&repos= which is running maintain-dbusers], and can be done as follows: <syntaxhighlight lang="shell-session"> $ ssh cloudcontrolXXXX.eqiad.wmnet $ sudo /usr/local/sbin/maintain-dbusers delete tools.${NAME} --account-type=tool :# or $ sudo /usr/local/sbin/maintain-dbusers delete ${USERNAME} --account-type=user </syntaxhighlight> Once the account has been deleted, the maintain-dbusers service will automatically recreate the user account. ==== Debugging bad MariaDB credentials ==== {{anchor|Debugging bad mysql credentials}} Sometimes things go wrong and a user's <code>replica.my.cnf</code> credentials don't propigate everywhere. You can check the status on various servers to try and narrow down what went wrong. The database credentials needed are in <code>/etc/dbusers.yaml</code> on the cloudcontrol host running maintain-dbusers. <syntaxhighlight lang="shell-session"> $ ssh cloudcontrolXXXX.eqiad.wmnet $ sudo cat /etc/dbusers.yaml :# look for the accounts-backend['password'] for the m5-master connections (user: labsdbaccounts) :# look for the labsdbs['password'] for the other connections (user: labsdbadmin) $ CHECK_UID=u12345 # User id to check for :# Check if the user is in our meta datastore $ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM account WHERE mysql_username='${CHECK_UID}'\G" :# Check if all the accounts are created in the labsdb boxes from meta datastore. $ ACCT_ID=.... # Account_id is foreign key (id from account table) $ mariadb -h m5-master.eqiad.wmnet -u labsdbaccounts -p -e "USE labsdbaccounts; SELECT * FROM labsdbaccounts.account_host WHERE account_id=${ACCT_ID}\G" :# Check the actual labsdbs if needed $ mariadb -h clouddbXXXX.eqiad.wmnet -u labsdbadmin -p -e 'SELECT User, Password from mysql.user where User like "${CHECK_UID}";' :# Resynchronize account state on the replicas by finding missing GRANTS on each db server $ sudo maintain-dbusers harvest-replicas </syntaxhighlight> See [[phab:T183644]] for an example of fixing automatic credential creation caused when a old LDAP user becomes a Toolforge member and has an untracked user account on toolsdb. === Regenerate kubernetes credentials for tools (.kube/config) === With admin credentials (root on a control plane node will do), run <code>kubectl -n tool-<toolname> delete cm maintain-kubeusers-<toolname></code>; it should get regenerated within minutes. === Adding K8S Components === See [[Portal:Toolforge/Admin/Kubernetes#Building_new_nodes]] === Deleting a tool === {{Tracked|T170355|Resolved}} For batch or CLI deletion of tools, use the 'mark_tool' command on a cloudcontrol node: [[File:Tool disable process.png|thumb|The awful truth about tool deletion]] <syntaxhighlight lang="shell-session"> andrew@cloudcontrol1003:~$ sudo mark_tool usage: mark_tool [-h] [--ldap-user LDAP_USER] [--ldap-password LDAP_PASSWORD] [--ldap-base-dn LDAP_BASE_DN] [--project PROJECT] [--disable] [--delete] [--enable] tool mark_tool: error: the following arguments are required: tool </syntaxhighlight> Maintainers can mark their tools for deletion using the "Disable tool" button on the tool's detail page on https://toolsadmin.wikimedia.org/. In either case, the immediate effect of disabling a tool is to stop any running jobs, prevent users from logging in as that tool, and schedule archiving and deletion for 40 days in the future. [[File:Tool restore process.png|thumb|A tool can be restored within 40 days of being disabled]] Tool archives are stored on the tools NFS server, currently <code>tools-nfs-2.tools.eqiad1.wikimedia.cloud</code>: <syntaxhighlight lang="shell-session"> root@labstore1004:/srv/disable-tool# ls -ltrah /srv/tools/archivedtools/ total 1.8G drwxr-xr-x 5 root root 4.0K Jun 21 19:37 .. -rw-r--r-- 1 root root 102K Jul 22 22:15 andrewtesttooltwo -rw-r--r-- 1 root root 45 Oct 13 00:47 andrewtesttooltwo.tgz -rw-r--r-- 1 root root 8.3M Oct 13 03:20 mediaplaycounts.tgz -rw-r--r-- 1 root root 1.8G Oct 13 04:01 projanalysis.tgz -rw-r--r-- 1 root root 1.3M Oct 13 21:05 reportsbot.tgz drwxr-xr-x 2 root root 4.0K Oct 13 21:10 . -rw-r--r-- 1 root root 719K Oct 13 21:10 wsm.tgz -rw-r--r-- 1 root root 4.8K Oct 13 21:20 andrewtesttoolfour.tgz </syntaxhighlight> The actual deletion process is shockingly complicated. A tool will only be archived and deleted if all of the prior steps succeed, but disabling of a tool should be a sure thing. === SSL certificates === See [[Portal:Toolforge/Admin/SSL_certificates]]. === Granting a tool write access to Elasticsearch === * Generate a random password and the mkpassword crypt entry for it using the script [[phab:P4372|new-es-password.sh]]. (Must be run a host with the `mkpasswd` command installed. (The mkpasswd is part of the whois Debian package.) <syntaxhighlight lang="shell-session"> $ ./new-es-password.sh tools.example tools.example elasticsearch.ini ---- [elasticsearch] user=tools.example password=A3rJqgFKxa/x4NlnIhmw2cXcV92it/Zv0Yt+a7yhxCw= ---- tools.example puppet master private (hieradata/labs/tools/common.yaml) ---- profile::toolforge::elasticsearch::haproxy::elastic_users: - name: 'tools.example' password: '$6$FYwP3wxT4K7O9EE$OA3P5972NWJVG/WUnD240sal34/dsNabbcawItevMYO9uoR.fJBrjSABex0EDW0wlkWHID1Tf4oJoiNvYFGmy/' </syntaxhighlight> * Add the private SHA512 hash to the [[Portal:Toolforge/Nodes#misc nodes|tools puppetserver]]: <syntaxhighlight lang="shell-session"> $ ssh tools-puppetserver-01.tools.eqiad1.wikimedia.cloud $ sudo -i # cd /srv/git/labs/private # vim hieradata/labs/tools/common.yaml ... merge the hiera data with the existing key... :wq # git add hieradata/labs/tools/common.yaml # git commit -m "[local] Elasticsearch credentials for $TOOL" </syntaxhighlight> * Force a puppet run on tools-elastic nodes using [[Cumin]] <syntaxhighlight lang="shell-session"> cloudcumin1001.eqiad.wmnet:~$ sudo cumin "O{project:tools name:.*elastic.*}" "run-puppet-agent" </syntaxhighlight> * Make the credentials available to the tool as [[Help:Toolforge/Envvars Service|envvars]]: <syntaxhighlight lang="shell-session"> $ ssh dev.toolforge.org $ sudo -i become example-tool $ toolforge envvars create TOOL_ELASTICSEARCH_USER Enter the value of your envvar (Hit Ctrl+C to cancel): <insert user> $ toolforge envvars create TOOL_ELASTICSEARCH_PASSWORD Enter the value of your envvar (Hit Ctrl+C to cancel): <insert password> </syntaxhighlight> '''Note:''' An older procedure placed the credentials in <code>/data/project/$TOOL/.elasticsearch.ini</code> instead. * Resolve the ticket! === Package upgrades === See [[Portal:Cloud_VPS/Admin/Managing_package_upgrades|Managing package upgrades]]. === Creating a new Docker image (e.g. for new versions of Node.js) === See [[Portal:Toolforge/Admin/Kubernetes#Docker_Images]] == Kubernetes == See [[Portal:Toolforge/Admin/Kubernetes]] == Build service == See [[Portal:Toolforge/Admin/Build_Service]] == Tools-mail / Exim == See [[Portal:Toolforge/Admin/Exim]] and [[Portal:Cloud_VPS/Admin/Email#Operations]] == Users and community == Some information about how to manage users and general community and their relationship with Toolforge. === Project membership request approval === User access requests show up in https://toolsadmin.wikimedia.org/tools/membership/ Some guidelines for account approvals, based on [[phab:T128158#2132893|advice from scfc]]: # If the request contains any defamatory or abusive information as part of the username(s), reason, or comments → mark as '''Declined''' and check the "Suppress this request (hide from non-admin users)" checkbox. #* You should also block the user on Wikitech and consider contacting a [[meta:Stewards|Steward]] for wider review of the SUL account. # If the user name "looks" like a bot or someone else who could not consent to the [[Wikitech:Cloud Services Terms of use|Terms of use]] and [[Help:Toolforge/Rules|Rules]] → mark as '''Declined'''. # Check the status of the associated [[meta:Help:Unified_login|SUL account]]. If the user is banned on one or more wikis → mark as '''Declined'''. # If the stated purpose is "tangible" ("I want to move my bot x to Toolforge", "I want to build a web app that does y", etc.) → mark as '''Approved'''. #* If you know that someone else has been working on the same problem, add a message explaining who the user should contact or where they might find more information. # If the stated purpose is "abstract" ("research", "experimentation", etc.) and there is a hackathon ongoing or planned, the user has a non-throw-away mail address, the user has created a user page with coherent information about theirself or linked a SUL account of good standing, etc. → mark as '''Approved'''. # Otherwise add a comment asking for clarification of their reason for use and mark as '''Feedback needed'''. The request is not really "denied", but more (indefinitely) "delayed". Requests left in '''Feedback needed''' for more information for more than 30 days should usually be declined with a message like "Feel free to apply again later with more complete information." === Quota management === Toolforge quotas are managed via maintain-kubeusers. * Have the user open a phabricator ticket, for the papertrail. See also [[Help:Toolforge/Kubernetes#Quotas_and_Resources]] * Send a patch for maintain-kubeusers, have it reviewed and merged: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/maintain-kubeusers/values/tools.yaml * Deploy in the cluster, using the deploy [[Portal:Cloud_VPS/Admin/Cookbooks]] == Other == === How do Toolforge web services actually work? === See [[Portal:Toolforge/Admin/Kubernetes#Ingress]] === What makes a root/Giving root access === See [[Portal:Toolforge/Admin/Toolforge_roots_and_Toolforge_admins|Toolforge roots and Toolforge admins]] === Servicegroup log === tools.admin runs <code>/data/project/admin/bin/toolhistory</code>, which provides an hourly snapshot of <code>ldaplist -l servicegroup</code> as git repository in <code>/data/project/admin/var/lib/git/servicegroups</code> === Useful administrative tools === These tools offer useful information about Toolforge itself: * [[toolforge:tool-db-usage/|ToolsDB]] - Statistics about tables owned by tools * [[toolforge:k8s-stats|k8s-stats]] - examine what our tools are doing * [[toolforge:openstack-browser/project/tools|OpenStack Browser]] - examine projects, instances, web proxies, and Puppet config == Brainstorming == * [[/BotLicensing]] == Sub pages == {{Special:Prefixindex/{{FULLPAGENAME}}/|hideredirects=1|stripprefix=1}} [[Category:Toolforge|Admin]] [[Category:Toolforge admin| ]] [[Category:Cloud Services admin|Toolforge]] qwmeqljhe269k0pcc5qdo87n06yozla Etcd 0 23934 2309718 2267543 2025-06-09T10:49:05Z FFurnari-WMF 36523 /* Example "hello world" query listing keys */ 2309718 wikitext text/x-wiki '''etcd''' is a distributed key/value store. * Overview: https://etcd.io * Source code: https://github.com/etcd-io/etcd * For documentation on etcd use from clients, see [[Etcd/Clients]] == Use at WMF == We currently have: # One cluster in eqiad for general use, running with https and client AUTH, part of the [[Etcd/Main cluster]] # One cluster in codfw for general use, running with https and client AUTH, part of the [[Etcd/Main cluster]] # One cluster on ganeti for kubernetes in eqiad, running with https; access is firewall-controlled. === Example "hello world" query listing keys === <syntaxhighlight lang="text"> razzi@cumin1002:~$ etcdctl -C https://conf1004.eqiad.wmnet:4001 ls /conftool/v1/pools/eqiad/ /conftool/v1/pools/eqiad/phabricator /conftool/v1/pools/eqiad/aqs ... </syntaxhighlight> == Operations == Note: There's no TLS for peer communications yet, so pay close attention to http vs https in the URLs and the port numbers used in various places. Another Note: As of 2024-07 this process does not work at all Debian Bookworm. === Bootstrapping an etcd cluster === ====== High level order of operations (after hosts are provisioned and in setup role): ====== # Deploy SRV DNS records described below # Prepare puppet patch to apply etcd profile, ensure cluster_bootstrap is initially true with <syntaxhighlight lang="yaml"> profile::etcd::v3::cluster_bootstrap: true </syntaxhighlight> # Merge puppet patch, run puppet on all new hosts # Prepare and deploy patch setting cluster_bootstrap false Before starting, there are a couple of things to keep in mind: * The etcd version suggested is 3.x, the version 2.x is unsupported. Note that the version of etcd isn't related to the version of the protocol. That is, version 3 of etcd speaks both version 2 and 3 of the protocol. * The supported production method to bootstrap an etcd cluster is by using DNS SRV records for discovery. * Traffic between nodes of the etcd cluster is encrypted, so TLS Certificates will be needed for both of the above use cases. ** These certificates can now be created automatically by the [[PKI/Clients|PKI]] puppet module. Configure the <code>use_pki_certs</code> parameter to be true when applying <code>profile::etcd::v3</code> in order to use this functionality. ** Prior to this method, [[Cergen]] was the tool to use do to create the keys and certificates. Please check other examples in the puppet private repo yaml configs. * Clients can be forced to use TLS client auth if needed, by adding the <code>::profile::etcd::v3::tlsproxy</code> profile to the cluster role's config. * The etcd cluster needs to be configured to allow node to know about each other. You will need to set the <code>profile::etcd::v3::discovery</code> hiera key to <code>dns:<SRV_RECORD_NAME></code>, that implies auto-discovery via DNS SRV records. For example, if you set <code>dns:k8s3.%{::site}.wmnet</code> then something like the following needs to be added to the DNS repo <syntaxhighlight lang="text"> # eqiad templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1004.eqiad.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1005.eqiad.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1006.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1004.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1005.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1006.eqiad.wmnet. # codfw templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2004.codfw.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2005.codfw.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2006.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2004.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2005.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2006.codfw.wmnet. </syntaxhighlight> Let's now try to follow the procedure to bootstrap a new cluster composed by various etcd100x.example.com nodes. You can do the following: # Assign the profile <code>profile::etcd::v3</code> to your servers roles, if using TLS client auth also add, <code>profile::etcd::v3::tlsproxy</code> # Define the following variables via hiera: <syntaxhighlight lang="yaml"># Name of the cluster. profile::etcd::v3::cluster_name: "<CLUSTER_NAME>" # Set to true when first building the cluster, it should be set to false if adding/removing members profile::etcd::v3::cluster_bootstrap: true # set this to "dns:<SRV_RECORD_NAME>" to use dns discovery profile::etcd::v3::discovery: "dns:pinkunicorn.%{site}.wmnet" # Set to true if you want to use client cert auth. Recommended: false. profile::etcd::v3::use_client_certs: false profile::etcd::v3::do_backup: false profile::etcd::v3::allow_from: "$DOMAIN_NETWORKS" # For the TLS proxy, you need the following variables too: # This cert is generated using puppet-ecdsacert, and includes # all the hostnames for the etcd machines in the SANs # Will need to be regenerated if we add servers to the cluster. profile::etcd::v3::tlsproxy::cert_name: "etcd.%{::domain}" profile::etcd::v3::tlsproxy::acls: { /: ["root"], /conftool: ["root", "conftool"], /eventlogging: []} # This should come from the private hieradata #profile::etcd::v3::tlsproxy::salt</syntaxhighlight> Now run puppet on one node, and it should bring up an etcd cluster. You can verify this with: <syntaxhighlight lang="bash"> $ etcdctl -C https://$(hostname -f):2379 cluster-health </syntaxhighlight> '''Now you can run puppet on the other nodes of the cluster and they should come up and be configured correctly.''' Once verified, flip the <code>profile::etcd::cluster_bootstrap</code> hiera variable to 'true' from 'false', and continue adding more nodes via the following procedure. === Adding a new member to the cluster === Say we want to add a new server called etcd1YYY<tt>.example.com (we keep this example irrelevant to actual hosts on purpose)</tt> to our cluster. The steps are as follows: <ol> <li>Add the member via the members api, using the <tt>etcdctl</tt> tool using one of the already existing members, e.g. etcd1XXX. <syntaxhighlight lang="bash"> $ etcdctl --endpoints https://etcd1XXX.example.com:2379 member add etcd1YYY https://etcd1YYY.example.com:2380 Added member named etcd1YYY with ID 5f62a924ac85910 to cluster ETCD_NAME="etcd1YYY" # Next line is broken down artificially for ease of reading ETCD_INITIAL_CLUSTER="etcd1XXX=http://etcd1XXX.example.com:2380, etcd1YYY=http://etcd1YYY.example.com:2380" ETCD_INITIAL_CLUSTER_STATE="existing" </syntaxhighlight> Write down the output as it will be useful for our puppet changes.</li> <li>Assign the etcd role to the node in puppet. <li>If you are using discovery SRV records (almost all the use cases, check the DNS repo for confirmation) you need to add a new record for the new hostname port 2380 and authdns-merge it, so when puppet runs you'll be able to see the new node joining the cluster. <li>If not using discovery SRV records (which is should be an edge case we don't have yet, consult with someone first), set the following variables for the whole cluster: <dl> <dd><code>profile::etcd::discovery</code> set to the value of <code>ETCD_INITIAL_CLUSTER</code> from the output of the etcdctl command before</dd> <dl> <li>Run puppet on the host. It should join the cluster. Confirm this is the case with the other hosts in the cluster as well (the logs should stop complaining about not reaching the new member)</li> <li>Finally, add the new server to the SRV records that clients consume.</li> <li>Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started. </li> </ol> === Removing a member from the cluster === <ol> <li>Verify the node you want to remove is not the current leader, that could run us into trouble: <syntaxhighlight lang='bash'> $ curl -k -L https://etcd1001:2379/v2/stats/leader {"message":"not current leader"} </syntaxhighlight></li> <li>Remove the server from the clients SRV record</li> <li>Dynamically remove the server from the cluster: <syntaxhighlight lang="shell-session"> $ etcdctl -C https://conf1001.example.com:2379 member remove etcd1001 http://etcd1001.example.com:2380 $ etcdctl -C https://conf1001.example.com:2379 cluster-health </syntaxhighlight></li> <li>Remove the server from the cluster's SRV record if present, or from the hiera variable <tt>profile::etcd::discovery</tt> if not using SRV records</li> <li>Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started. </li> </ol> === Recover a cluster after a disaster === In the sad case when RAFT consensus is lost and there is no quorum anymore, the only way to recover the cluster is to recover the data from a backup, which are regularly performed every night in <code>/srv/backups/etcd</code>. The procedure to bring back the cluster is roughly as follows: * Stop all etcd instances that might be still running * Copy the backup to a new location, start etcd from there; the etcd server listening to the public endpoints with the --force-new-cluster option. It will start with peer urls bound to localhost. * Change the peer url of this server to what you'd expect it to be in normal situations * Add your other servers to the cluster, as follows: ** Verify the original etcd data are removed ** Add the server to the cluster logically with etcdctl ** Start etcd in order to join the cluster. As usual with etcd, the devil lies in the details of the command-line options; but there is a python script that, given the current cluster configuration, can generate the correct commands you'll have to enter into a shell. It can be found in the paste at [[phab:P3855|P3855]]. === Reimage nodes a cluster === If you need to reimage nodes in cluster, there are two strategies that you can follow: * Reimage one node at the time, while preserving the distributed log's data. This strategy works only if you remove/add the node via etcdctl after every reimage, since otherwise etcd will refuse to start on it (complaining about the Raft log being not up to date). This requires that the cluster is configured in status "existing" (and not "new"), via <code>profile::etcd::v3::cluster_bootstrap: false</code> * Reimage all nodes at once, hence not preserving the distributed log's data. This requires that the cluster is configured in status "new" (and not "existing"), via <code>profile::etcd::v3::cluster_bootstrap: true</code> Another idea could be to stop all the etcd daemons on all the nodes, and reimage one node at the time. This may work, but since we use ETCD_DISCOVERY_SRV etcd is likely going to contact the nodes in the cluster while bootstrapping (for example, to do leader election) ending up in connection failures. == See also == * [[config-master.wikimedia.org]] * [[Conftool]] * [[SLO/etcd main cluster]] {{Lowercase title}} [[Category:Services]] [[Category:Operations]] rmu6oxrs5rzayoes1ugnjwm07j3n229 2309720 2309718 2025-06-09T10:49:57Z FFurnari-WMF 36523 /* Example "hello world" query listing keys */ 2309720 wikitext text/x-wiki '''etcd''' is a distributed key/value store. * Overview: https://etcd.io * Source code: https://github.com/etcd-io/etcd * For documentation on etcd use from clients, see [[Etcd/Clients]] == Use at WMF == We currently have: # One cluster in eqiad for general use, running with https and client AUTH, part of the [[Etcd/Main cluster]] # One cluster in codfw for general use, running with https and client AUTH, part of the [[Etcd/Main cluster]] # One cluster on ganeti for kubernetes in eqiad, running with https; access is firewall-controlled. === Example "hello world" query listing keys === <syntaxhighlight lang="text"> razzi@cumin1002:~$ etcdctl -C https://conf1007.eqiad.wmnet:4001 ls /conftool/v1/pools/eqiad/ /conftool/v1/pools/eqiad/phabricator /conftool/v1/pools/eqiad/aqs ... </syntaxhighlight> == Operations == Note: There's no TLS for peer communications yet, so pay close attention to http vs https in the URLs and the port numbers used in various places. Another Note: As of 2024-07 this process does not work at all Debian Bookworm. === Bootstrapping an etcd cluster === ====== High level order of operations (after hosts are provisioned and in setup role): ====== # Deploy SRV DNS records described below # Prepare puppet patch to apply etcd profile, ensure cluster_bootstrap is initially true with <syntaxhighlight lang="yaml"> profile::etcd::v3::cluster_bootstrap: true </syntaxhighlight> # Merge puppet patch, run puppet on all new hosts # Prepare and deploy patch setting cluster_bootstrap false Before starting, there are a couple of things to keep in mind: * The etcd version suggested is 3.x, the version 2.x is unsupported. Note that the version of etcd isn't related to the version of the protocol. That is, version 3 of etcd speaks both version 2 and 3 of the protocol. * The supported production method to bootstrap an etcd cluster is by using DNS SRV records for discovery. * Traffic between nodes of the etcd cluster is encrypted, so TLS Certificates will be needed for both of the above use cases. ** These certificates can now be created automatically by the [[PKI/Clients|PKI]] puppet module. Configure the <code>use_pki_certs</code> parameter to be true when applying <code>profile::etcd::v3</code> in order to use this functionality. ** Prior to this method, [[Cergen]] was the tool to use do to create the keys and certificates. Please check other examples in the puppet private repo yaml configs. * Clients can be forced to use TLS client auth if needed, by adding the <code>::profile::etcd::v3::tlsproxy</code> profile to the cluster role's config. * The etcd cluster needs to be configured to allow node to know about each other. You will need to set the <code>profile::etcd::v3::discovery</code> hiera key to <code>dns:<SRV_RECORD_NAME></code>, that implies auto-discovery via DNS SRV records. For example, if you set <code>dns:k8s3.%{::site}.wmnet</code> then something like the following needs to be added to the DNS repo <syntaxhighlight lang="text"> # eqiad templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1004.eqiad.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1005.eqiad.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd1006.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1004.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1005.eqiad.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd1006.eqiad.wmnet. # codfw templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2004.codfw.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2005.codfw.wmnet. templates/wmnet:_etcd-server-ssl._tcp.k8s3 5M IN SRV 0 1 2380 kubetcd2006.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2004.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2005.codfw.wmnet. templates/wmnet:_etcd-client-ssl._tcp.k8s3 5M IN SRV 0 1 2379 kubetcd2006.codfw.wmnet. </syntaxhighlight> Let's now try to follow the procedure to bootstrap a new cluster composed by various etcd100x.example.com nodes. You can do the following: # Assign the profile <code>profile::etcd::v3</code> to your servers roles, if using TLS client auth also add, <code>profile::etcd::v3::tlsproxy</code> # Define the following variables via hiera: <syntaxhighlight lang="yaml"># Name of the cluster. profile::etcd::v3::cluster_name: "<CLUSTER_NAME>" # Set to true when first building the cluster, it should be set to false if adding/removing members profile::etcd::v3::cluster_bootstrap: true # set this to "dns:<SRV_RECORD_NAME>" to use dns discovery profile::etcd::v3::discovery: "dns:pinkunicorn.%{site}.wmnet" # Set to true if you want to use client cert auth. Recommended: false. profile::etcd::v3::use_client_certs: false profile::etcd::v3::do_backup: false profile::etcd::v3::allow_from: "$DOMAIN_NETWORKS" # For the TLS proxy, you need the following variables too: # This cert is generated using puppet-ecdsacert, and includes # all the hostnames for the etcd machines in the SANs # Will need to be regenerated if we add servers to the cluster. profile::etcd::v3::tlsproxy::cert_name: "etcd.%{::domain}" profile::etcd::v3::tlsproxy::acls: { /: ["root"], /conftool: ["root", "conftool"], /eventlogging: []} # This should come from the private hieradata #profile::etcd::v3::tlsproxy::salt</syntaxhighlight> Now run puppet on one node, and it should bring up an etcd cluster. You can verify this with: <syntaxhighlight lang="bash"> $ etcdctl -C https://$(hostname -f):2379 cluster-health </syntaxhighlight> '''Now you can run puppet on the other nodes of the cluster and they should come up and be configured correctly.''' Once verified, flip the <code>profile::etcd::cluster_bootstrap</code> hiera variable to 'true' from 'false', and continue adding more nodes via the following procedure. === Adding a new member to the cluster === Say we want to add a new server called etcd1YYY<tt>.example.com (we keep this example irrelevant to actual hosts on purpose)</tt> to our cluster. The steps are as follows: <ol> <li>Add the member via the members api, using the <tt>etcdctl</tt> tool using one of the already existing members, e.g. etcd1XXX. <syntaxhighlight lang="bash"> $ etcdctl --endpoints https://etcd1XXX.example.com:2379 member add etcd1YYY https://etcd1YYY.example.com:2380 Added member named etcd1YYY with ID 5f62a924ac85910 to cluster ETCD_NAME="etcd1YYY" # Next line is broken down artificially for ease of reading ETCD_INITIAL_CLUSTER="etcd1XXX=http://etcd1XXX.example.com:2380, etcd1YYY=http://etcd1YYY.example.com:2380" ETCD_INITIAL_CLUSTER_STATE="existing" </syntaxhighlight> Write down the output as it will be useful for our puppet changes.</li> <li>Assign the etcd role to the node in puppet. <li>If you are using discovery SRV records (almost all the use cases, check the DNS repo for confirmation) you need to add a new record for the new hostname port 2380 and authdns-merge it, so when puppet runs you'll be able to see the new node joining the cluster. <li>If not using discovery SRV records (which is should be an edge case we don't have yet, consult with someone first), set the following variables for the whole cluster: <dl> <dd><code>profile::etcd::discovery</code> set to the value of <code>ETCD_INITIAL_CLUSTER</code> from the output of the etcdctl command before</dd> <dl> <li>Run puppet on the host. It should join the cluster. Confirm this is the case with the other hosts in the cluster as well (the logs should stop complaining about not reaching the new member)</li> <li>Finally, add the new server to the SRV records that clients consume.</li> <li>Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started. </li> </ol> === Removing a member from the cluster === <ol> <li>Verify the node you want to remove is not the current leader, that could run us into trouble: <syntaxhighlight lang='bash'> $ curl -k -L https://etcd1001:2379/v2/stats/leader {"message":"not current leader"} </syntaxhighlight></li> <li>Remove the server from the clients SRV record</li> <li>Dynamically remove the server from the cluster: <syntaxhighlight lang="shell-session"> $ etcdctl -C https://conf1001.example.com:2379 member remove etcd1001 http://etcd1001.example.com:2380 $ etcdctl -C https://conf1001.example.com:2379 cluster-health </syntaxhighlight></li> <li>Remove the server from the cluster's SRV record if present, or from the hiera variable <tt>profile::etcd::discovery</tt> if not using SRV records</li> <li>Make sure to restart navtiming (on webperf hosts) as it is a long running process and doesn't refresh etcd SRV records once it is started. </li> </ol> === Recover a cluster after a disaster === In the sad case when RAFT consensus is lost and there is no quorum anymore, the only way to recover the cluster is to recover the data from a backup, which are regularly performed every night in <code>/srv/backups/etcd</code>. The procedure to bring back the cluster is roughly as follows: * Stop all etcd instances that might be still running * Copy the backup to a new location, start etcd from there; the etcd server listening to the public endpoints with the --force-new-cluster option. It will start with peer urls bound to localhost. * Change the peer url of this server to what you'd expect it to be in normal situations * Add your other servers to the cluster, as follows: ** Verify the original etcd data are removed ** Add the server to the cluster logically with etcdctl ** Start etcd in order to join the cluster. As usual with etcd, the devil lies in the details of the command-line options; but there is a python script that, given the current cluster configuration, can generate the correct commands you'll have to enter into a shell. It can be found in the paste at [[phab:P3855|P3855]]. === Reimage nodes a cluster === If you need to reimage nodes in cluster, there are two strategies that you can follow: * Reimage one node at the time, while preserving the distributed log's data. This strategy works only if you remove/add the node via etcdctl after every reimage, since otherwise etcd will refuse to start on it (complaining about the Raft log being not up to date). This requires that the cluster is configured in status "existing" (and not "new"), via <code>profile::etcd::v3::cluster_bootstrap: false</code> * Reimage all nodes at once, hence not preserving the distributed log's data. This requires that the cluster is configured in status "new" (and not "existing"), via <code>profile::etcd::v3::cluster_bootstrap: true</code> Another idea could be to stop all the etcd daemons on all the nodes, and reimage one node at the time. This may work, but since we use ETCD_DISCOVERY_SRV etcd is likely going to contact the nodes in the cluster while bootstrapping (for example, to do leader election) ending up in connection failures. == See also == * [[config-master.wikimedia.org]] * [[Conftool]] * [[SLO/etcd main cluster]] {{Lowercase title}} [[Category:Services]] [[Category:Operations]] b6rbiy2wa3img4lgfwv6r3gvi5yefla MariaDB/misc 0 199474 2309643 2273097 2025-06-09T06:17:17Z MArostegui (WMF) 8620 /* Current schemas */ 2309643 wikitext text/x-wiki [[File:Wikimedia-relational-databases-2022.png|thumb|500px|Diagram of DB sections]] == Misc informations == There are 4 "miscellaneous" shards: m1-m5. * '''m1''': Internal services used by SRE Team ([[Bacula]], [[LibreNMS]]), and [[Etherpad]]. * '''m2''': [[VRT System|VRTS]], [[DebMonitor]], [[XHGui]], recommendations api, and others. * '''m3''': [[Phabricator]], and legacy issue tracking systems. * '''m5''': [[Mailman]], cxserverdb, Wikitech wiki, WMCS-related services ([[Toolsadmin.wikimedia.org]], [[Toolhub.wikimedia.org]]), and others. * '''db_inventory''': [[Orchestrator]] (including dbtree backend) and [[Zarcillo]]. On the last cleanup, many unused databases were archived and/or deleted, and a contact person was discovered for each of them. == m1 == === Current schemas === These are the current dbs, and what was needed to failover then: * '''bacula9''': The [[Bacula]] metadata database. We make sure there is not backup running at the time so we avoid backup failures. Currently we stop bacula-dir (may require puppet disabling to prevent it from automatically restarting) to make sure no new backups start and potentially fail, as temporarily stopping the director should not have any user imapact. If backups are running, stopping the daemon will cancel the ongoing jobs. Consider rescheduling them (run) if they are important and time-sensitive, otherwise they will be schedule at a later time automatically following configuration. On bacula start, sometimes the bacula prometheus exporter could enter in a race condition with the bacula daemon- so it might require a <code>systemctl restart prometheus-bacula-exporter.service</code> Owners: Jaime, backup: Alex * '''cas''': Database to store 2FA tokens registered via Apereo CAS (idp.wikimedia.org): Owners: John Bond, Moritz https://phabricator.wikimedia.org/T268327 * '''dbbackups''': Database backups metadata, on master failover need manual update as it doesn't use the proxy. Owners: Jaime At the moment, it requires manual migration of connections after failover: https://gerrit.wikimedia.org/r/c/operations/puppet/+/668449 * '''etherpadlite:''' seems like etherpad-lite errors out and terminates after the migration. Normally systemd takes care of it and restarts it instantly. However if the maintenance window takes long enough, systemd will back out and stop trying to restart, in which case a systemctl restart etherpad-lite will be required. etherpad crashes anyway at least once a week if not more so no big deal ; tested by opening a pad. Owners: Alex. Killed idle db connection on failover. * '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs. * '''pki''': Database to store signed certificates managed by pki.discovery.wmnet: Owners: John Bond, Moritz (https://phabricator.wikimedia.org/T268329). Sometimes it needs to be restarted: pki1001:~# systemctl restart cfssl-ocsprefresh-debmonitor.service * '''librenms''': required manual kill of its connections <code>@netmon1001: apache reload</code> Owners: Netops (Arzhel). Killed idle db connection. * '''rddmarc''': ? * '''rt''': Old ticket manger, kept in read only for reference of contracts/orders, etc. Owners: Daniel, alex can help. Mosty used by RobH. Required manual kill of its connections ; <code>@unobtinium: apache reload</code> Restarted apache2 on ununpentium to reset connections. * '''zuul''': Zuul upgrade project. New version of zuul and it's needed as part of a zuul upgrade project. Zuul will reconnect automatically; no intervention should be required. If Zuul happens to be writing to it at the time, it will retry for about 15 seconds, and if that fails, it will give up and continue. That may mean that we end up with incomplete data in the database (so we may miss build result information). If that's critical, then users may need to manually retrigger a build. But otherwise, Zuul will continue and recover and future builds should be fine. === Deleted/archived schemas === * '''bacula''' old bacula database (for bacula 7.x). Archived into the backups "archive pool" * '''blog''': to archive * '''bugzilla''': to archive * kill archived and dropped * '''bugzilla3''': idem kill archived and dropped * '''bugzilla4''': idem archive, actually, we also have this on dumps.wm.org https://dumps.wikimedia.org/other/bugzilla/ but that is the sanitized version, so keep this archive just in case i guess * '''bugzilla_testing''': idem kill archived and dropped * '''communicate''': ? archived and dropped * '''communicate_civicrm''': not fundraising! we're not sure what this is, we can check users table to determine who administered it archived and dropped * '''dashboard_production''': Puppet dashboard db. Never used it in my 3 years here, product sucks. Kill with fire. - alex archived and dropped * '''outreach_civicrm''': not fundraising, this is the contacts.wm thing, not used anymore, but in turn it means i dont know what "communicate" is then, we can look at the users tables for info on the * '''admin''': archived and dropped * '''outreach_drupal''': kill archived and dropped * '''percona''': jynus dropped * '''puppet''': required manual kill of its connections; This caused the most puppet spam. Either restart puppet-masters or kill connections **as soon** as the failover happens. Puppet no longer uses mysql, but its own postgres-backed storage. Was kept for a while for stats/observability. Owner: Alex * '''query_digests''': jynus archived and dropped * '''racktables''': Migrated to netbox, which uses Postgres. Finally removed. Owners: DC ops. jmm checked it after failover. went fine, no problems. * '''test''': archived and dropped * '''test_drupal''': er, kill with fire ? kill archived and dropped == m2 == === Current schemas === These are the current dbs, and what was needed to failover then: * '''otrs''': Normally requires restart of otrs-daemon, apache on ''mendelevium''. People: arnoldokoth, lsobanski * '''debmonitor''': Normally nothing is required. People: volans, moritz, simon **Django smoothly fails over without any manual intervention. **At most check <code>sudo tail -F /var/log/debmonitor/main.log</code> on the active Debmonitor host (<code>debmonitor1003</code> as of Feb. 2024). ***Some failed writes logged with <code>HTTP/1.1 500</code> and a stacktrace like <code>django.db.utils.OperationalError: (1290, 'The MariaDB server is running with the --read-only option so it cannot execute this statement')</code> are expected, followed by the resume of normal operations with most write operations logged as <code>HTTP/1.1 201</code>. **In case of issues it's safe to try a restart performing: <code>sudo systemctl restart debmonitor-server.service</code> * '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs. * '''xhgui''': [[XHGui]], SRE Observability team * '''excimer''': [[Excimer UI]], SRE Observability team * '''recommendationapi''': k8s service, nothing required, should "just work". People: akosiaris, only user is Android application. * '''iegreview''': Shared nothing PHP application; should "just work". People: bd808, Niharika * '''scholarships''': Shared nothing PHP application; should "just work". People: bd808, Niharika * '''sockpuppet''': Sockpuppet detection service (also known as the similar-users service). PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. https://phabricator.wikimedia.org/T268505. People: Hnowlan * '''mwaddlink''': (https://phabricator.wikimedia.org/T267214 )The Link Recommendation Service is an application hosted on kubernetes with an API accessible via HTTP. It responds to a POST request containing wikitext of an article and responds with a structured response of link recommendations for the article. It does not have caching or storage; the client (MediaWiki) is responsible for doing that. MySQL table per wiki is used for caching the actual link recommendations (task T261411); each row contains serialized link recommendations for a particular article. https://wikitech.wikimedia.org/wiki/Add_Link . People: sgimeno dbproxies will need reload (''systemctl reload haproxy && echo "show stat" | socat /run/haproxy/haproxy.sock stdio''). You can check what's the active proxy by: host m2-master.eqiad.wmnet The passive can be checked by running ''grep -iR m2 hieradata/hosts/*'' on the puppet repo === Deleted/archived schemas === * testotrs: alex: kill it with ice and fire * testblog: archive it like blog * bugzilla_testing: archive it with the rest of bugzillas * reviewdb + reviewdb-test (deprecated & deleted): Gerrit: Normally needs a restart on ''gerrit1001'' just in case. People: akosiaris, hashar == m3 == === Current schemas === * '''phabricator_*''': 57 schemas to support phabricator itself * '''rt_migration''': schema needed for some crons related to phabricator jobs * '''bugzilla_migration''': schema needed for some crons related to phabricator jobs * '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs. === Dropped schemas === * fab_migration == m5 == === Current schemas === * '''striker''': schema for [[toolsadmin.wikimedia.org]] (Striker) * '''labsdbaccounts''' schema for [[Portal:Data_Services/Admin/Shared_storage#maintain-dbusers|maintain-dbusers]] (Toolforge) * ''' test_labsdbaccounts''' cloud team (not in use) https://phabricator.wikimedia.org/T255950#6260581 * '''cxserverdb''' Generate section mappings ([[phab:T306963|T306963]]). * '''idm and idm_staging''' ([[phab:T338008|T338008]]). * '''ipoid''' Generate section mappings ([[phab:T305114|T305114]]) * '''mailman3 and mailman3web''' Generate section mappings ([[phab:T278614|T278614]]) * '''heartbeat''': Its writes should stop/start automatically when switching its puppet primary/replica config. Will need cleanup of old records after switch, for Orchestrator, see: [[MariaDB#Misc_section_failover_checklist_(example_with_m2)]] Owners: DBAs. == db_inventory == * [[Orchestrator]] * [[Zarcillo]] * [[Category:MySQL]] mf34u663j3xd60vv4qximu8ptdcxc84 Data Platform/Data Lake/Edits/MediaWiki history 0 253925 2309683 2305032 2025-06-09T09:40:31Z GGoncalves-WMF 42848 Add a summary diagram of the pipeline and its main child datasets. 2309683 wikitext text/x-wiki This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in the [[Analytics/Systems/Cluster|Analytics Hadoop cluster]] and is accessible via the Hive table <code>wmf.mediawiki_history</code>. A new snapshot covering all of history is generated from the source data each month. The process is summarized in the diagram below; see [[Analytics/Systems/Data Lake/Edits/Pipeline]] for more details. <imagemap> File:Pipeline tree - MediaWiki History and main child datasets.jpg|thumb|center|800px|alt=MediaWiki History and its main child datasets - clickable image|MediaWiki History and main child datasets. '''Tip:''' Most of the boxes are clickable and link to their documentation. rect 4135 492 5215 1233 [[Help:Wiki_Replicas]] rect 2897 1572 4427 1988 [[Data_Platform/Systems/Edit_data_loading]] rect 961 2726 2329 3490 [[Data_Platform/Data_Lake/Edits]] rect 845 3697 2364 4126 [[Data_Platform/Systems/Page_and_user_history_reconstruction]] rect 2845 3442 4337 4378[[Data_Platform/Data_Lake/Edits/MediaWiki_history_dumps]] rect 4817 2319 5885 2783 [[https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Edits/Geoeditors#Editors_daily]] rect 4817 3001 5885 3437 [[Data_Platform/Data_Lake/Edits/Geoeditors]] rect 4817 3716 5885 4176 [[Data_Platform/Data_Lake/Edits/Edit_hourly]] rect 4817 4567 5940 4978 [[Data_Platform/Systems/Mediawiki_history_reduced_algorithm]] rect 6568 3001 7607 3437 [[Data_Platform/Data_Lake/Edits/Geoeditors/Public]] rect 6568 3716 7607 4176 [[Data_Platform/Data_Lake/Edits/Edit_hourly]] rect 6568 4436 7834 5102 [[Data_Platform/Data_Lake/Edits/Mediawiki_history_reduced]] </imagemap> == Public version == This data is published as a collection of files on our dumps infrastructure: [[Analytics/Data_Lake/Edits/Mediawiki_history_dumps]]. == Schema == For schema documentation, see the [https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf.mediawiki_history,PROD)/Schema entry in DataHub]. == Changes and known problems == {| class="wikitable" !Date !Phab Task !Snapshot version !Details |- |2023-11-01 |{{Phabricator|T350489}} |2023-10 |The mediawiki_project_namespace_map table schema was updated. The update was backwards-compatible but the code used the raw data, superimposing its own schema. This was the right decision for performance when we created the job, but latest Spark makes this unnecessary. The job should be updated to use a select statement and future-proof itself. This has not been prioritized. |- |2023-09-01 |{{Phabricator|T344632}} |2023-08 |A system user, "Global_rename_script", was given an id and caused a sizeable shift in data. The checker errors were ignored as false alarms. |- |2023-08-03 |{{Phabricator|T345208}} |2023-07 |Fixes to how redacted actor ids show up on Cloud replicas caused downstream problems in MW history. Skew-join helper logic was updated and jobs were rerun. The checker still flagged a sizable difference, probably due to deleted users no longer being seen as valid actors. It was decided that we should ignore this difference and not vet the data further. |- |2022-06-01 |{{Phabricator|T309987}} |2022-05 |Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets. |- |2020-08 |{{Phabricator|T259823}} |2020-06 |Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations |- |2019-07 |{{Phabricator|T221825}} |2019-05 |Schema changes: * Addition of <code>page_first_edit_timestamp</code> * Addition of <code>revision_is_from_before_page_creation</code> Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate. |- |2019-05 |{{Phabricator|T221824}} |2019-04 |Schema changes: * Addition of <code>event_user_is_bot_by_historical</code> and <code>event_user_is_bot_by</code> as well as <code>user_is_bot_by_historical</code> and <code>user_is_bot_by</code> * Addition of <code>event_user_creation_timestamp</code>, <code>event_user_first_timestamp</code> as well as <code>user_creation_timestamp</code>, <code>user_first_timestamp</code>. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not. * Removal ('''BREAKING''') <code>of event_user_is_bot_by_name</code> and <code>user_is_bot_name</code> (replaced by <code>is_bot_by</code> above) * Addition of <code>page_is_deleted</code> * Addition of <code>revision_deleted_parts</code> and <code>revision_deleted_parts_are_suppressed</code> * Rename of <code>revision_is_deleted</code> to <code>revision_is_deleted_by_page_deletion</code>, and <code>revision_deleted_timestamp</code> to <code>revision_deleted_by_page_deletion_timestamp.</code> * Addition of <code>revision_tags</code> Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463). |- |2018-10 |{{Phabricator|T209031}} |2018-10 and 2018-11 |due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here. |- |2017-12 | |2017-11 |For pairs of fields that give current and historical versions of a value, rename the fields so that <code>_historical</code> is appended to the historical field rather than <code>_latest</code> to the current one. Revisions happening before page-creation date (due to restore over existing page) are now correctly linked. History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release. |- |2017-06 | {{Phabricator|T161147}} | 2017-06 |Provide cumulative edit count |- |2017-06 | {{Phabricator|T170493}} | 2017-06 |Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings) |- |2016-10-06 | |n/a |The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. |- |2017-03-01 | |n/a |Add the <code>snapshot</code> partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs. |} [[Category:Edits data]] [[Category:Data platform]] 2cq9o8jow6k9gvu57fjdesn8hga3ecc 2309689 2309683 2025-06-09T09:55:29Z GGoncalves-WMF 42848 Fix diagram link to Editors daily. 2309689 wikitext text/x-wiki This page describes the data set that stores the '''denormalized edit history''' of WMF's wikis. It lives in the [[Analytics/Systems/Cluster|Analytics Hadoop cluster]] and is accessible via the Hive table <code>wmf.mediawiki_history</code>. A new snapshot covering all of history is generated from the source data each month. The process is summarized in the diagram below; see [[Analytics/Systems/Data Lake/Edits/Pipeline]] for more details. <imagemap> File:Pipeline tree - MediaWiki History and main child datasets.jpg|thumb|center|800px|alt=MediaWiki History and its main child datasets - clickable image|MediaWiki History and main child datasets. '''Tip:''' Most of the boxes are clickable and link to their documentation. rect 4135 492 5215 1233 [[Help:Wiki_Replicas]] rect 2897 1572 4427 1988 [[Data_Platform/Systems/Edit_data_loading]] rect 961 2726 2329 3490 [[Data_Platform/Data_Lake/Edits]] rect 845 3697 2364 4126 [[Data_Platform/Systems/Page_and_user_history_reconstruction]] rect 2845 3442 4337 4378[[Data_Platform/Data_Lake/Edits/MediaWiki_history_dumps]] rect 4817 2319 5885 2783 [[Data_Platform/Data_Lake/Edits/Geoeditors#Editors_daily]] rect 4817 3001 5885 3437 [[Data_Platform/Data_Lake/Edits/Geoeditors]] rect 4817 3716 5885 4176 [[Data_Platform/Data_Lake/Edits/Edit_hourly]] rect 4817 4567 5940 4978 [[Data_Platform/Systems/Mediawiki_history_reduced_algorithm]] rect 6568 3001 7607 3437 [[Data_Platform/Data_Lake/Edits/Geoeditors/Public]] rect 6568 3716 7607 4176 [[Data_Platform/Data_Lake/Edits/Edit_hourly]] rect 6568 4436 7834 5102 [[Data_Platform/Data_Lake/Edits/Mediawiki_history_reduced]] </imagemap> == Public version == This data is published as a collection of files on our dumps infrastructure: [[Analytics/Data_Lake/Edits/Mediawiki_history_dumps]]. == Schema == For schema documentation, see the [https://datahub.wikimedia.org/dataset/urn:li:dataset:(urn:li:dataPlatform:hive,wmf.mediawiki_history,PROD)/Schema entry in DataHub]. == Changes and known problems == {| class="wikitable" !Date !Phab Task !Snapshot version !Details |- |2023-11-01 |{{Phabricator|T350489}} |2023-10 |The mediawiki_project_namespace_map table schema was updated. The update was backwards-compatible but the code used the raw data, superimposing its own schema. This was the right decision for performance when we created the job, but latest Spark makes this unnecessary. The job should be updated to use a select statement and future-proof itself. This has not been prioritized. |- |2023-09-01 |{{Phabricator|T344632}} |2023-08 |A system user, "Global_rename_script", was given an id and caused a sizeable shift in data. The checker errors were ignored as false alarms. |- |2023-08-03 |{{Phabricator|T345208}} |2023-07 |Fixes to how redacted actor ids show up on Cloud replicas caused downstream problems in MW history. Skew-join helper logic was updated and jobs were rerun. The checker still flagged a sizable difference, probably due to deleted users no longer being seen as valid actors. It was decided that we should ignore this difference and not vet the data further. |- |2022-06-01 |{{Phabricator|T309987}} |2022-05 |Changes in the production database caused sqoop to break, delays in the mw history job, and delays for all dependent datasets. |- |2020-08 |{{Phabricator|T259823}} |2020-06 |Some page ids are null or zero, and other records appear as duplicates when attempting to use some seemingly unique column combinations |- |2019-07 |{{Phabricator|T221825}} |2019-05 |Schema changes: * Addition of <code>page_first_edit_timestamp</code> * Addition of <code>revision_is_from_before_page_creation</code> Improvements in linking more user and page events into full histories, that we were not able to put together before. Dataset should in general be more consistent and accurate. |- |2019-05 |{{Phabricator|T221824}} |2019-04 |Schema changes: * Addition of <code>event_user_is_bot_by_historical</code> and <code>event_user_is_bot_by</code> as well as <code>user_is_bot_by_historical</code> and <code>user_is_bot_by</code> * Addition of <code>event_user_creation_timestamp</code>, <code>event_user_first_timestamp</code> as well as <code>user_creation_timestamp</code>, <code>user_first_timestamp</code>. The user registration is the one stored in the user table, the user creation one is retrieved from the logging table (user creation event), and the first-edit is the date of the user first edit, whether deleted or not. * Removal ('''BREAKING''') <code>of event_user_is_bot_by_name</code> and <code>user_is_bot_name</code> (replaced by <code>is_bot_by</code> above) * Addition of <code>page_is_deleted</code> * Addition of <code>revision_deleted_parts</code> and <code>revision_deleted_parts_are_suppressed</code> * Rename of <code>revision_is_deleted</code> to <code>revision_is_deleted_by_page_deletion</code>, and <code>revision_deleted_timestamp</code> to <code>revision_deleted_by_page_deletion_timestamp.</code> * Addition of <code>revision_tags</code> Thanks to improvement made on user-history-reconstruction, linking between user-states and page/revision states is now a lot more accurate (see Task T218463). |- |2018-10 |{{Phabricator|T209031}} |2018-10 and 2018-11 |due to the refactor of mediawiki-comments into a separate table, the revision-comments are not available in the table for the two snapshots listed here. |- |2017-12 | |2017-11 |For pairs of fields that give current and historical versions of a value, rename the fields so that <code>_historical</code> is appended to the historical field rather than <code>_latest</code> to the current one. Revisions happening before page-creation date (due to restore over existing page) are now correctly linked. History of pages with complex delete/restore patterns is on purpose not yet orretly worked. Will happen after Wikistats-2 release. |- |2017-06 | {{Phabricator|T161147}} | 2017-06 |Provide cumulative edit count |- |2017-06 | {{Phabricator|T170493}} | 2017-06 |Use native timestamps (java.sql.Timestamp, but stillsaves them as JDBC compliant strings) |- |2016-10-06 | |n/a |The dataset contains data for simplewiki and enwiki until september 2016. Still we need to productionize the automatic updates to that table and import all the wikis. |- |2017-03-01 | |n/a |Add the <code>snapshot</code> partition, allowing to keep multiple versions of the history. Data starts to flow regularly (every month) from labs. |} [[Category:Edits data]] [[Category:Data platform]] knxizp253n8bq4op8o1tik2jbxhb54o Nova Resource:Tools.yifeibot/SAL 498 293634 2309626 2255678 2025-06-08T12:05:09Z Stashbot 7414 wmbot~multichill@tools-bastion-12: Checked for T395205 YiFeiBot and SignBot already use BotPasswords 2309626 wikitext text/x-wiki === 2025-06-08 === * 12:05 wmbot~multichill@tools-bastion-12: Checked for [[phab:T395205|T395205]] YiFeiBot and SignBot already use BotPasswords === 2024-12-18 === * 18:41 wmbot~anticomposite@tools-bastion-13: kubectl rollout restart deployment flr # bot not processing files === 2024-12-07 === * 19:26 wmbot~lucaswerkmeister@tools-bastion-13: kubectl rollout restart deployment flr # w.wiki/CLN5 === 2024-11-30 === * 21:40 wmbot~lucaswerkmeister@tools-bastion-13: kubectl rollout restart deployment flr # after reports of FlickreviewR 2 not working on IRC and COM:AN === 2024-08-26 === * 14:17 wmbot~lucaswerkmeister@tools-bastion-13: kubectl rollout restart deployment flr # after reports of FlickreviewR 2 not working on IRC [originally logged 14:11 UTC but stashbot was gone] === 2024-08-18 === * 15:56 wmbot~lucaswerkmeister@tools-bastion-13: kubectl rollout restart deployment flr # after reports of FlickreviewR 2 not working on IRC === 2024-05-31 === * 15:40 wmbot~bd808@tools-bastion-12: `kubectl delete pod flr-6d74b958d9-bgkdw` after reports of FlickreviewR 2 not working on IRC === 2024-04-04 === * 22:03 wmbot~bd808@tools-sgebastion-10: `kubectl delete pod flr-6d74b958d9-4ztff` after reports of FlickreviewR 2 not working on IRC === 2024-03-13 === * 03:06 wmbot~bd808@tools-sgebastion-11: 'kubectl delete pod flr-6d74b958d9-pc6p9' after reports of FlickreviewR 2 not working on IRC === 2024-03-11 === * 15:47 wmbot~bd808@tools-sgebastion-10: 'kubectl delete pod flr-6d74b958d9-w7fhc' after reports of FlickreviewR 2 not working on IRC === 2024-02-18 === * 15:44 wmbot~taavi@tools-sgebastion-11: 'kubectl delete pod flr-6d74b958d9-b28dz' after reports of FlickreviewR 2 not working on IRC === 2023-02-10 === * 10:59 taavi: bump quotas per request in [[phab:T329350|T329350]] === 2022-06-04 === * 18:20 wm-bot: <multichill> Fixed Flickr bot by casting license[id] to string in /data/project/yifeibot/o/toolserver/bryan/flickr/shared/flickr.py === 2020-02-28 === * 19:10 wm-bot: <root> Migrated to 2020 Kubernetes cluster === 2016-11-30 === * 22:47 bd808: Deleted 2 jobs running on tools-exec-1210 for many hours/days ([[phab:T151980|T151980]]) <noinclude>[[Category:SAL]]</noinclude> iqg1wrzhg542qjkc3wgsg11663d75l5 Map of database maintenance 0 449160 2309636 2309618 2025-06-09T00:01:51Z Dexbot 30554 Bot: Updating the report 2309636 wikitext text/x-wiki {{/Header}} == Today (2025-06-09) == == Yesterday (2025-06-08) == == Last seven days == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | es6 || [[phab:T395867|Switchover es6 master (es1037 -&gt; es1038) (T395867)]] (marostegui) |- | es7 || * [[phab:T395647|Migrate es7 to MariaDB 10.11 (T395647)]] (marostegui) * [[phab:T395982|Switchover es7 master (es1035 -&gt; es1039) (T395982)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s6 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | es6 || [[phab:T395420|Switchover es6 master (es2037 -&gt; es2035) (T395420)]] (marostegui) |- | es7 || * [[phab:T395771|Productionize es2047, es2048, es1047, es1048 (T395771)]] (marostegui) * [[phab:T395785|Switchover es7 master (es2038 -&gt; es2039) (T395785)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || * [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} [[Category:MariaDB]] au614i8enjge0xw0a4c6ril72usoq6c 2309650 2309636 2025-06-09T07:35:44Z Dexbot 30554 Bot: Updating the report 2309650 wikitext text/x-wiki {{/Header}} == Today (2025-06-09) == {| class="wikitable" |+ codfw |- ! Section !! Work |- | s6 || [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- |} == Yesterday (2025-06-08) == == Last seven days == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | es6 || [[phab:T395867|Switchover es6 master (es1037 -&gt; es1038) (T395867)]] (marostegui) |- | es7 || * [[phab:T395647|Migrate es7 to MariaDB 10.11 (T395647)]] (marostegui) * [[phab:T395982|Switchover es7 master (es1035 -&gt; es1039) (T395982)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s6 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | es6 || [[phab:T395420|Switchover es6 master (es2037 -&gt; es2035) (T395420)]] (marostegui) |- | es7 || * [[phab:T395771|Productionize es2047, es2048, es1047, es1048 (T395771)]] (marostegui) * [[phab:T395785|Switchover es7 master (es2038 -&gt; es2039) (T395785)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || * [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} [[Category:MariaDB]] 6vo534q69iqzp2pudwz4dub4shk76uc 2309686 2309650 2025-06-09T09:44:59Z Dexbot 30554 Bot: Updating the report 2309686 wikitext text/x-wiki {{/Header}} == Today (2025-06-09) == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | s6 || [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- |} == Yesterday (2025-06-08) == == Last seven days == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | es6 || [[phab:T395867|Switchover es6 master (es1037 -&gt; es1038) (T395867)]] (marostegui) |- | es7 || * [[phab:T395647|Migrate es7 to MariaDB 10.11 (T395647)]] (marostegui) * [[phab:T395982|Switchover es7 master (es1035 -&gt; es1039) (T395982)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s6 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | es6 || [[phab:T395420|Switchover es6 master (es2037 -&gt; es2035) (T395420)]] (marostegui) |- | es7 || * [[phab:T395771|Productionize es2047, es2048, es1047, es1048 (T395771)]] (marostegui) * [[phab:T395785|Switchover es7 master (es2038 -&gt; es2039) (T395785)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || * [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} [[Category:MariaDB]] rptczgjp2586z3siie2mcblxv910442 2309696 2309686 2025-06-09T10:09:02Z Dexbot 30554 Bot: Updating the report 2309696 wikitext text/x-wiki {{/Header}} == Today (2025-06-09) == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | s6 || * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- |} == Yesterday (2025-06-08) == == Last seven days == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | es6 || [[phab:T395867|Switchover es6 master (es1037 -&gt; es1038) (T395867)]] (marostegui) |- | es7 || * [[phab:T395647|Migrate es7 to MariaDB 10.11 (T395647)]] (marostegui) * [[phab:T395982|Switchover es7 master (es1035 -&gt; es1039) (T395982)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s6 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | es6 || [[phab:T395420|Switchover es6 master (es2037 -&gt; es2035) (T395420)]] (marostegui) |- | es7 || * [[phab:T395771|Productionize es2047, es2048, es1047, es1048 (T395771)]] (marostegui) * [[phab:T395785|Switchover es7 master (es2038 -&gt; es2039) (T395785)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || * [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} [[Category:MariaDB]] ajgu5azigogrqw6uo2fjeayfx0rpi5d 2309704 2309696 2025-06-09T10:24:06Z Dexbot 30554 Bot: Updating the report 2309704 wikitext text/x-wiki {{/Header}} == Today (2025-06-09) == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | s6 || * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- |} == Yesterday (2025-06-08) == == Last seven days == {| class="wikitable" |+ eqiad |- ! Section !! Work |- | es6 || [[phab:T395867|Switchover es6 master (es1037 -&gt; es1038) (T395867)]] (marostegui) |- | es7 || * [[phab:T395647|Migrate es7 to MariaDB 10.11 (T395647)]] (marostegui) * [[phab:T395982|Switchover es7 master (es1035 -&gt; es1039) (T395982)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} {| class="wikitable" |+ codfw |- ! Section !! Work |- | es6 || [[phab:T395420|Switchover es6 master (es2037 -&gt; es2035) (T395420)]] (marostegui) |- | es7 || * [[phab:T395771|Productionize es2047, es2048, es1047, es1048 (T395771)]] (marostegui) * [[phab:T395785|Switchover es7 master (es2038 -&gt; es2039) (T395785)]] (marostegui) |- | pc1 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc2 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc3 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc4 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc5 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc6 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc7 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | pc8 || [[phab:T395983|Migrate /srv/sqldata-cache directory in parsercache to /srv/sqldata (T395983)]] (marostegui) |- | s2 || [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) |- | s4 || [[phab:T395241|Login (T395241)]] (fceratto) |- | s6 || * [[phab:T383795|Move sX to STATEMENT based replication (T383795)]] (marostegui) * [[phab:T395989|Migrate s6 to MariaDB 10.11 (T395989)]] (marostegui) * [[phab:T396130|Add afl_ip_hex column and afl_var_dump_timestamp index to abuse_filter_log (T396130)]] (marostegui) |- | s8 || [[phab:T395241|Login (T395241)]] (fceratto) |- |} [[Category:MariaDB]] 01w1cfcgfrv44hcrrzlfazsl11o5wx6 Help:Toolforge/Envvars 12 452933 2309658 2302159 2025-06-09T08:22:58Z Taavi-WMF 41365 tweak recommended ways of inputting values 2309658 wikitext text/x-wiki {{Toolforge nav}} Toolforge tools can configure '''envvars''' ([[:w:Environment variable|environment variables]]) that will be available for your application when running (both [[Help:Toolforge/Web|webservices]] and [[Help:Toolforge/Jobs|jobs]] running in [[Help:Toolforge/Kubernetes|Kubernetes]] are supported, built with the newer [[Help:Toolforge/Build_Service|build service]] or using one of the provided images). The service is suitable for storing both configuration and secrets for your application, allowing workflows like having different config and secrets in your development/ci/production environments without changing a line of code. The envvar values are only available to the tool’s code and maintainers (though the <em>names</em> are publicly visible, e.g. in [[toolforge:k8s-status|k8s-status]]). == Roadmap == === Features that are already available === * Create/update, delete and list environment variables. * Automatically inject environment variables to your application on startup (if you add or modify them, you'll have to restart your application). * Some system envvars are there (replicas user, replicas password, toolsdb user, toolsdb password, ...), all starting with <code>TOOL_</code>, see [https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/envvars-admission/values/tools.yaml?ref_type=heads#L5 the config for details]. === Planned features that are not available yet === * Create a ticket with a feature request if you have any suggestions == Quickstart == === Prerequisites === If you don't have a tool account yet, you need to create or join one. Detailed instructions are available at [[Help:Toolforge/Quickstart]]. You should have a webservice or job running or ready to run. See [[Help:Toolforge/Build Service]] on how to build and start a webservice or job. Non-buildpack based webservices and jobs are also supported, for those see [[Help:Toolforge/Quickstart#Host your first tool on a tool account]]. === Creating a new environment variable === Once you have setup your toolforge account, you can ssh to the bastion and become your tool (using <code>wm-lol</code> tool as an example): <syntaxhighlight lang="shell-session"> $ ssh myuser@login.toolforge.org myuser@tools-sgebastion-10:~$ become wm-lol tools.wm-lol@tools-sgebastion-10:~$ </syntaxhighlight> Now you can run <code>toolforge envvars --help</code> to see the available commands and syntax for each: <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars --help Usage: toolforge-envvars [OPTIONS] COMMAND [ARGS]... Toolforge command line Options: -v, --verbose Show extra verbose output. NOTE: Do no rely on the format of the verbose output -d, --debug show logs to debug the toolforge-envvars-* packages. For extra verbose output for say build or job, see --verbose --version Show the version and exit. --help Show this message and exit. Commands: create Create/update an envvar. delete Delete an envvar. list List all your envvars. </syntaxhighlight> For example, we can create a new environment variable ('''note''' that the name for the envvar has to be all caps, and has the same restriction as a sh/bash environment variables, and can't overwrite the system <code>TOOL_*</code> variables). There's many ways you can do it: <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ echo "from stdin" | toolforge envvars create TEST name value TEST from stdin tools.wm-lol@tools-sgebastion-10:~$ echo "from file" > envvar_file tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars create TEST <envvar_file name value TEST from file tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars create TEST Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): name value TEST from prompt tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars create TEST "from argument" name value TEST from argument </syntaxhighlight> {{note|Never specify any secret values as command line arguments, as they will be exposed to other users of the same bastion host and are saved to your shell history file.|type=warn}} Now we have to start or restart the running webservice so it picks up the new environment variable (this example restarts a running buildservice based webservice): <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ toolforge webservice buildservice restart </syntaxhighlight> That's all! That environment variable is now available for the web service. :) === Updating an environment variable === Imagine that our api key has changed, and now we want to update it. The process is exactly the same, the <code>create</code> command will update the variable if it already exists: <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars list name value API_KEY my_api_key1 tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars create API_KEY Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): name value API_KEY my_api_key2 </syntaxhighlight> Remember to restart your running webservices/jobs if you want the new value to be picked up! <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ toolforge webservice buildservice restart </syntaxhighlight> === Globally set environment variables === There's a few of environment variables that will be always set for you, currently you can find the list in [[gitlab:repos/cloud/toolforge/toolforge-deploy/-/blob/main/components/envvars-admission/values/tools.yaml]]. [[phab:T394408|T394408]] tracks a feature request to show them in <code>toolforge envvars list</code> output. All of them will start with <code>TOOL_*</code>, some of them are: * <code>TOOL_TOOLFORGE_API_URL</code>: url to the toolforge api (ex. ''https:<nowiki/>//api.svc.tools.eqiad1.wikimedia.cloud:30003''). * <code>TOOL_REDIS_URI</code>: uri for the shared redis service (ex. ''redis:<nowiki/>//redis.svc.tools.eqiad1.wikimedia.cloud:6379''). * <code>TOOL_ELASTICSEARCH_URL</code>: url for the elasticsearch service (ex. ''http:<nowiki/>//elasticsearch.svc.tools.eqiad1.wikimedia.cloud:80''). * <code>TOOL_DATA_DIR</code>: Path to your tool's NFS mounted home directory. The <code>$HOME</code> envvar will not always match this path. Your application should rely on <code>$TOOL_DATA_DIR</code> instead of <code>$HOME</code> to locate shared files and directories. There are some environment variables that will be set only once for you (you can overwrite them, this is not recommended however), those are: * <code>TOOL_TOOLSDB_USER</code>: tool specific user for toolsdb * <code>TOOL_TOOLSDB_PASSWORD</code>: tool specific password for toolsdb * <code>TOOL_REPLICA_USER</code>: tool specific user for the replica databases * <code>TOOL_REPLICA_PASSWORD</code>: tool specific password for the replica databases == Common problems and solutions == Please add to this section any issues you encountered and how you solved them. === How to add the contents of a file to an enviroment variable? === Though it's not recommended, sometimes there are some files that you will want to put in an environment variable (ex. secret certificate keys). Use [https://www.gnu.org/software/bash/manual/html_node/Redirections.html#Redirecting-Input shell input redirection] to feed the file's contents to the standard input of the <code>toolforge envvars create</code> command: <syntaxhighlight lang="shell-session"> tools.wm-lol@tools-sgebastion-10:~$ toolforge envvars create MY_CERT_KEY <secret.key </syntaxhighlight> === How to know if my code is running inside toolforge? === There's several environment variables that will always be set when running a job or a webservice, these all start with <code>TOOL_*</code>, one of them is <code>TOOL_TOOLFORGE_API_URL</code>, so you can rely that if that variable exists, you are running inside toolforge. == History == Below are some historical discussions that led to its current design and implementation. * [[phab:T335979|Decision request Phabricator task]] {{:Help:Cloud Services communication}} kp2ptp3jf413hy6r7cjayot2w0nvugf Tool:Phab-ban/Log 116 453426 2309628 2309605 2025-06-08T14:53:22Z Phabbanbot 37210 DANISHAHMED111 was disabled by JJMC89 2309628 wikitext text/x-wiki <noinclude>'''Audit log of bans''' made via https://phab-ban.toolforge.org. Some bans made prior to 2023-09-01 were manually logged at [[phab:T200856]]. __NOTOC__</noinclude> === 2025-06-08 === * 14:53 [[phab:p/DANISHAHMED111|DANISHAHMED111]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-06-07 === * 05:50 [[phab:p/PCJND|PCJND]] was disabled by [[phab:p/Johannnes89/|Johannnes89]] === 2025-06-04 === * 08:37 [[phab:p/Alpasli|Alpasli]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-06-03 === * 02:07 [[phab:p/Jj881|Jj881]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-05-29 === * 05:51 [[phab:p/RodneyAraujo|RodneyAraujo]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-04-28 === * 15:07 [[phab:p/Hansmuller|Hansmuller]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-04-03 === * 15:57 [[phab:p/Wfan|Wfan]] was disabled by [[phab:p/Zabe/|Zabe]] === 2025-03-30 === * 10:15 [[phab:p/Watnoii24|Watnoii24]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-23 === * 11:27 [[phab:p/Saadtbli|Saadtbli]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-22 === * 16:45 [[phab:p/Stephonjeffries19|Stephonjeffries19]] was disabled by [[phab:p/LucasWerkmeister/|LucasWerkmeister]] * 04:32 [[phab:p/Chriswarriortv|Chriswarriortv]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-19 === * 11:29 [[phab:p/Vinay080|Vinay080]] was disabled by [[phab:p/zeljkofilipin/|zeljkofilipin]] === 2025-03-18 === * 12:04 [[phab:p/Walshandpartners777|Walshandpartners777]] was disabled by [[phab:p/Lucas_Werkmeister_WMDE/|Lucas_Werkmeister_WMDE]] === 2025-03-04 === * 01:33 [[phab:p/Porokhov|Porokhov]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-02-25 === * 17:27 [[phab:p/Selahaddin751|Selahaddin751]] was disabled by [[phab:p/brennen/|brennen]] === 2025-02-19 === * 01:00 [[phab:p/Mrb_Rafi|Mrb_Rafi]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-02-14 === * 19:19 [[phab:p/3652candy|3652candy]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 17:01 [[phab:p/Ataysaa|Ataysaa]] was disabled by [[phab:p/bd808/|bd808]] === 2025-02-09 === * 09:10 [[phab:p/BTullis|BTullis]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2025-02-08 === * 23:26 [[phab:p/Alexdivkovic05|Alexdivkovic05]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-02-06 === * 06:19 [[phab:p/HormigasAIS|HormigasAIS]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-29 === * 07:40 [[phab:p/Denker61|Denker61]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-01-25 === * 21:36 [[phab:p/Khnthichith|Khnthichith]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-24 === * 11:33 [[phab:p/Aek191010|Aek191010]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-05 === * 15:40 [[phab:p/szsuperzuper|szsuperzuper]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2025-01-01 === * 09:08 [[phab:p/GALAXYENTERPRISES|GALAXYENTERPRISES]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-20 === * 00:52 [[phab:p/Mail.faluzes|Mail.faluzes]] was disabled by [[phab:p/Reedy/|Reedy]] === 2024-12-13 === * 02:01 [[phab:p/Gussdafii|Gussdafii]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-12-11 === * 08:54 [[phab:p/CodeTrailblazer|CodeTrailblazer]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 08:54 [[phab:p/SelvikIN|SelvikIN]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-03 === * 05:16 [[phab:p/Matkospajdr|Matkospajdr]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-01 === * 19:27 [[phab:p/Adarshsingh|Adarshsingh]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-28 === * 22:47 [[phab:p/Sandraklemma|Sandraklemma]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-23 === * 09:00 [[phab:p/Mahimabajpayee12|Mahimabajpayee12]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-11 === * 11:09 [[phab:p/Mvwservices|Mvwservices]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 07:00 [[phab:p/Impactolog|Impactolog]] was disabled by [[phab:p/revi/|revi]] === 2024-10-30 === * 09:05 [[phab:p/Jweighed1|Jweighed1]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-10-25 === * 04:20 [[phab:p/Blunt2531|Blunt2531]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-10-08 === * 08:31 [[phab:p/Surfcityrecovery|Surfcityrecovery]] was disabled by [[phab:p/MoritzMuehlenhoff/|MoritzMuehlenhoff]] === 2024-10-01 === * 21:49 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] === 2024-09-27 === * 10:20 [[phab:p/SorBP|SorBP]] was disabled by [[phab:p/TheresNoTime/|TheresNoTime]] === 2024-09-08 === * 10:45 [[phab:p/Robin_Mathew_Rajan|Robin_Mathew_Rajan]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-09-02 === * 17:58 [[phab:p/Idxntcx|Idxntcx]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-09-01 === * 10:11 [[phab:p/LDAP|LDAP]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-07-22 === * 08:56 [[phab:p/Nobleadele|Nobleadele]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-06-18 === * 20:54 [[phab:p/Playgiirlkaybrazy|Playgiirlkaybrazy]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-06-08 === * 22:30 [[phab:p/Exposingsesion1|Exposingsesion1]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-05-27 === * 12:24 [[phab:p/JosefineHellrothLarssonWMSE|JosefineHellrothLarssonWMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 07:54 [[phab:p/SMMpanels|SMMpanels]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-05-22 === * 11:31 [[phab:p/Sandra_Fauconnier_WMSE|Sandra_Fauconnier_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:23 [[phab:p/MiaJacobssonWMSE|MiaJacobssonWMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/David_Haskiya_WMSE|David_Haskiya_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/kalle|kalle]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/Tore_Danielsson_WMSE|Tore_Danielsson_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/Gitta|Gitta]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/annatroberg|annatroberg]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/AxelPettersson_WMSE|AxelPettersson_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:17 [[phab:p/SaraMortsell|SaraMortsell]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] === 2024-05-13 === * 14:55 [[phab:p/BenoitPrieur|BenoitPrieur]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2024-05-04 === * 19:54 [[phab:p/Sammoon391|Sammoon391]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-05-01 === * 07:24 [[phab:p/Soubag|Soubag]] was disabled by [[phab:p/Mainframe98/|Mainframe98]] === 2024-04-28 === * 06:22 [[phab:p/Diamondscoin|Diamondscoin]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-04-19 === * 03:47 [[phab:p/Wawmart2|Wawmart2]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-04-08 === * 10:21 [[phab:p/Mardetanha|Mardetanha]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2024-03-29 === * 09:03 [[phab:p/Abdollmjjedloveanan|Abdollmjjedloveanan]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-03-12 === * 09:45 [[phab:p/Samantha78462|Samantha78462]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:39 [[phab:p/Samantha7861654654|Samantha7861654654]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:11 [[phab:p/Robin|Robin]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:05 [[phab:p/Anglinakuki|Anglinakuki]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 00:02 [[phab:p/Johnne25|Johnne25]] was disabled by [[phab:p/bd808/|bd808]] === 2024-03-07 === * 20:07 [[phab:p/Sami785|Sami785]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-03-06 === * 07:41 [[phab:p/28q|28q]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-03-02 === * 22:51 [[phab:p/kitchenstrategic|kitchenstrategic]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-02-23 === * 09:03 [[phab:p/littleggghost|littleggghost]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-02-17 === * 05:10 [[phab:p/Skekeiei|Skekeiei]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-01-27 === * 03:50 [[phab:p/Andybitcoin|Andybitcoin]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-01-25 === * 09:10 [[phab:p/Mayo3030|Mayo3030]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-21 === * 05:59 [[phab:p/Hackear|Hackear]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 05:58 [[phab:p/joselopez45|joselopez45]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-20 === * 20:02 [[phab:p/08107130655|08107130655]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-18 === * 05:33 [[phab:p/Tecnologynew|Tecnologynew]] was disabled by [[phab:p/TheresNoTime/|TheresNoTime]] === 2024-01-12 === * 22:34 [[phab:p/cchen|cchen]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-01-11 === * 07:34 [[phab:p/Bernita43|Bernita43]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-01-07 === * 10:13 [[phab:p/Irademack|Irademack]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2023-12-31 === * 16:48 [[phab:p/Vieclamdmpt|Vieclamdmpt]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2023-12-26 === * 20:34 [[phab:p/Bgu5678|Bgu5678]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2023-11-26 === * 20:09 [[phab:p/Str13tlife|Str13tlife]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2023-11-24 === * 00:49 [[phab:p/Imambuchori03|Imambuchori03]] was disabled by [[phab:p/DannyS712/|DannyS712]] === 2023-11-22 === * 15:50 [[phab:p/Naleksuh|Naleksuh]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2023-11-19 === * 21:10 [[phab:p/Onack16888|Onack16888]] was disabled by [[phab:p/Daimona/|Daimona]] === 2023-11-15 === * 11:57 [[phab:p/Anonymous_ehacker|Anonymous_ehacker]] was disabled by [[phab:p/hashar/|hashar]] === 2023-11-09 === * 23:16 [[phab:p/dunicorn|dunicorn]] was disabled by [[phab:p/bd808/|bd808]] === 2023-09-30 === * 17:38 [[phab:p/Wykirany|Wykirany]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2023-09-01 === * 16:29 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] * 16:17 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] dtrrk5qo4y9j022jlqdeltcypoki3dt 2309629 2309628 2025-06-08T14:59:16Z Phabbanbot 37210 Jaypam001 was disabled by JJMC89 2309629 wikitext text/x-wiki <noinclude>'''Audit log of bans''' made via https://phab-ban.toolforge.org. Some bans made prior to 2023-09-01 were manually logged at [[phab:T200856]]. __NOTOC__</noinclude> === 2025-06-08 === * 14:59 [[phab:p/Jaypam001|Jaypam001]] was disabled by [[phab:p/JJMC89/|JJMC89]] * 14:53 [[phab:p/DANISHAHMED111|DANISHAHMED111]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-06-07 === * 05:50 [[phab:p/PCJND|PCJND]] was disabled by [[phab:p/Johannnes89/|Johannnes89]] === 2025-06-04 === * 08:37 [[phab:p/Alpasli|Alpasli]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-06-03 === * 02:07 [[phab:p/Jj881|Jj881]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-05-29 === * 05:51 [[phab:p/RodneyAraujo|RodneyAraujo]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-04-28 === * 15:07 [[phab:p/Hansmuller|Hansmuller]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-04-03 === * 15:57 [[phab:p/Wfan|Wfan]] was disabled by [[phab:p/Zabe/|Zabe]] === 2025-03-30 === * 10:15 [[phab:p/Watnoii24|Watnoii24]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-23 === * 11:27 [[phab:p/Saadtbli|Saadtbli]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-22 === * 16:45 [[phab:p/Stephonjeffries19|Stephonjeffries19]] was disabled by [[phab:p/LucasWerkmeister/|LucasWerkmeister]] * 04:32 [[phab:p/Chriswarriortv|Chriswarriortv]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-03-19 === * 11:29 [[phab:p/Vinay080|Vinay080]] was disabled by [[phab:p/zeljkofilipin/|zeljkofilipin]] === 2025-03-18 === * 12:04 [[phab:p/Walshandpartners777|Walshandpartners777]] was disabled by [[phab:p/Lucas_Werkmeister_WMDE/|Lucas_Werkmeister_WMDE]] === 2025-03-04 === * 01:33 [[phab:p/Porokhov|Porokhov]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-02-25 === * 17:27 [[phab:p/Selahaddin751|Selahaddin751]] was disabled by [[phab:p/brennen/|brennen]] === 2025-02-19 === * 01:00 [[phab:p/Mrb_Rafi|Mrb_Rafi]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2025-02-14 === * 19:19 [[phab:p/3652candy|3652candy]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 17:01 [[phab:p/Ataysaa|Ataysaa]] was disabled by [[phab:p/bd808/|bd808]] === 2025-02-09 === * 09:10 [[phab:p/BTullis|BTullis]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2025-02-08 === * 23:26 [[phab:p/Alexdivkovic05|Alexdivkovic05]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-02-06 === * 06:19 [[phab:p/HormigasAIS|HormigasAIS]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-29 === * 07:40 [[phab:p/Denker61|Denker61]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2025-01-25 === * 21:36 [[phab:p/Khnthichith|Khnthichith]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-24 === * 11:33 [[phab:p/Aek191010|Aek191010]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2025-01-05 === * 15:40 [[phab:p/szsuperzuper|szsuperzuper]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2025-01-01 === * 09:08 [[phab:p/GALAXYENTERPRISES|GALAXYENTERPRISES]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-20 === * 00:52 [[phab:p/Mail.faluzes|Mail.faluzes]] was disabled by [[phab:p/Reedy/|Reedy]] === 2024-12-13 === * 02:01 [[phab:p/Gussdafii|Gussdafii]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-12-11 === * 08:54 [[phab:p/CodeTrailblazer|CodeTrailblazer]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 08:54 [[phab:p/SelvikIN|SelvikIN]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-03 === * 05:16 [[phab:p/Matkospajdr|Matkospajdr]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-12-01 === * 19:27 [[phab:p/Adarshsingh|Adarshsingh]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-28 === * 22:47 [[phab:p/Sandraklemma|Sandraklemma]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-23 === * 09:00 [[phab:p/Mahimabajpayee12|Mahimabajpayee12]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-11-11 === * 11:09 [[phab:p/Mvwservices|Mvwservices]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 07:00 [[phab:p/Impactolog|Impactolog]] was disabled by [[phab:p/revi/|revi]] === 2024-10-30 === * 09:05 [[phab:p/Jweighed1|Jweighed1]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-10-25 === * 04:20 [[phab:p/Blunt2531|Blunt2531]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-10-08 === * 08:31 [[phab:p/Surfcityrecovery|Surfcityrecovery]] was disabled by [[phab:p/MoritzMuehlenhoff/|MoritzMuehlenhoff]] === 2024-10-01 === * 21:49 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] === 2024-09-27 === * 10:20 [[phab:p/SorBP|SorBP]] was disabled by [[phab:p/TheresNoTime/|TheresNoTime]] === 2024-09-08 === * 10:45 [[phab:p/Robin_Mathew_Rajan|Robin_Mathew_Rajan]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-09-02 === * 17:58 [[phab:p/Idxntcx|Idxntcx]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-09-01 === * 10:11 [[phab:p/LDAP|LDAP]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-07-22 === * 08:56 [[phab:p/Nobleadele|Nobleadele]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-06-18 === * 20:54 [[phab:p/Playgiirlkaybrazy|Playgiirlkaybrazy]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-06-08 === * 22:30 [[phab:p/Exposingsesion1|Exposingsesion1]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-05-27 === * 12:24 [[phab:p/JosefineHellrothLarssonWMSE|JosefineHellrothLarssonWMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 07:54 [[phab:p/SMMpanels|SMMpanels]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-05-22 === * 11:31 [[phab:p/Sandra_Fauconnier_WMSE|Sandra_Fauconnier_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:23 [[phab:p/MiaJacobssonWMSE|MiaJacobssonWMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/David_Haskiya_WMSE|David_Haskiya_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/kalle|kalle]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:20 [[phab:p/Tore_Danielsson_WMSE|Tore_Danielsson_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/Gitta|Gitta]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/annatroberg|annatroberg]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:19 [[phab:p/AxelPettersson_WMSE|AxelPettersson_WMSE]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] * 10:17 [[phab:p/SaraMortsell|SaraMortsell]] was disabled by [[phab:p/Sebastian_Berlin-WMSE/|Sebastian_Berlin-WMSE]] === 2024-05-13 === * 14:55 [[phab:p/BenoitPrieur|BenoitPrieur]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2024-05-04 === * 19:54 [[phab:p/Sammoon391|Sammoon391]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-05-01 === * 07:24 [[phab:p/Soubag|Soubag]] was disabled by [[phab:p/Mainframe98/|Mainframe98]] === 2024-04-28 === * 06:22 [[phab:p/Diamondscoin|Diamondscoin]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-04-19 === * 03:47 [[phab:p/Wawmart2|Wawmart2]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-04-08 === * 10:21 [[phab:p/Mardetanha|Mardetanha]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2024-03-29 === * 09:03 [[phab:p/Abdollmjjedloveanan|Abdollmjjedloveanan]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-03-12 === * 09:45 [[phab:p/Samantha78462|Samantha78462]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:39 [[phab:p/Samantha7861654654|Samantha7861654654]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:11 [[phab:p/Robin|Robin]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 09:05 [[phab:p/Anglinakuki|Anglinakuki]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 00:02 [[phab:p/Johnne25|Johnne25]] was disabled by [[phab:p/bd808/|bd808]] === 2024-03-07 === * 20:07 [[phab:p/Sami785|Sami785]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-03-06 === * 07:41 [[phab:p/28q|28q]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-03-02 === * 22:51 [[phab:p/kitchenstrategic|kitchenstrategic]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-02-23 === * 09:03 [[phab:p/littleggghost|littleggghost]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-02-17 === * 05:10 [[phab:p/Skekeiei|Skekeiei]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-01-27 === * 03:50 [[phab:p/Andybitcoin|Andybitcoin]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2024-01-25 === * 09:10 [[phab:p/Mayo3030|Mayo3030]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-21 === * 05:59 [[phab:p/Hackear|Hackear]] was disabled by [[phab:p/Peachey88/|Peachey88]] * 05:58 [[phab:p/joselopez45|joselopez45]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-20 === * 20:02 [[phab:p/08107130655|08107130655]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2024-01-18 === * 05:33 [[phab:p/Tecnologynew|Tecnologynew]] was disabled by [[phab:p/TheresNoTime/|TheresNoTime]] === 2024-01-12 === * 22:34 [[phab:p/cchen|cchen]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-01-11 === * 07:34 [[phab:p/Bernita43|Bernita43]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2024-01-07 === * 10:13 [[phab:p/Irademack|Irademack]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2023-12-31 === * 16:48 [[phab:p/Vieclamdmpt|Vieclamdmpt]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2023-12-26 === * 20:34 [[phab:p/Bgu5678|Bgu5678]] was disabled by [[phab:p/Peachey88/|Peachey88]] === 2023-11-26 === * 20:09 [[phab:p/Str13tlife|Str13tlife]] was disabled by [[phab:p/JJMC89/|JJMC89]] === 2023-11-24 === * 00:49 [[phab:p/Imambuchori03|Imambuchori03]] was disabled by [[phab:p/DannyS712/|DannyS712]] === 2023-11-22 === * 15:50 [[phab:p/Naleksuh|Naleksuh]] was disabled by [[phab:p/WMFOffice/|WMFOffice]] === 2023-11-19 === * 21:10 [[phab:p/Onack16888|Onack16888]] was disabled by [[phab:p/Daimona/|Daimona]] === 2023-11-15 === * 11:57 [[phab:p/Anonymous_ehacker|Anonymous_ehacker]] was disabled by [[phab:p/hashar/|hashar]] === 2023-11-09 === * 23:16 [[phab:p/dunicorn|dunicorn]] was disabled by [[phab:p/bd808/|bd808]] === 2023-09-30 === * 17:38 [[phab:p/Wykirany|Wykirany]] was disabled by [[phab:p/RhinosF1/|RhinosF1]] === 2023-09-01 === * 16:29 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] * 16:17 [[phab:p/T200856-01|T200856-01]] was disabled by [[phab:p/bd808/|bd808]] 508ibtowu2yhnnw779fj10orlin0kqd Help:Toolforge/Running Pywikibot scripts 12 453743 2309659 2159014 2025-06-09T08:23:38Z Taavi-WMF 41365 tweak recommended ways of inputting envvars 2309659 wikitext text/x-wiki {{Toolforge nav}} The '''[[:mw:Special:MyLanguage/Manual:Pywikibot|Pywikibot]]''' framework is a Python library and collection of scripts to automate work on MediaWiki sites. This tutorial contains instructions on how to run '''[[:mw:Special:MyLanguage/Manual:Pywikibot/Scripts|Pywikibot's built-in scripts]]''' on Toolforge using the [[Help:Toolforge/Jobs framework|Toolforge jobs framework]]. {{Note|If you want to run a script that is not included with Pywikibot itself, follow the [[Help:Toolforge/Running Pywikibot scripts (advanced)|advanced Pywikibot tutorial]].}} == Prerequisites == To effectively use Pywikibot on Toolforge, you need: * Access to a tool account on Toolforge. See [[Help:Toolforge/Quickstart]] to learn how to set it up and use. Note that to use Toolforge you need a beginner-level understanding of Linux terminal, SSH (see [[wikibooks:Internet Technologies/SSH|Internet Technologies/SSH]]), and Bash (see [[wikibooks:Bash Shell Scripting|Bash Shell Scripting]]). * Basic familiarity with Python and Pywikibot. You don't need to know how to program in Python but it helps to understand how to run Python scripts. To learn Python, see [[wikibooks:Non-Programmer's Tutorial for Python 3|Non-Programmer's Tutorial for Python 3]] or [https://docs.python.org/3/tutorial/ The Python Tutorial in Python documentation].For information about Pywikibot, see [[mw:Special:MyLanguage/Manual:Pywikibot|Manual:Pywikibot]]. == Setup == The provided Pywikibot image uses [[mw:Manual:Pywikibot/OAuth/Wikimedia|OAuth]] for authenticating with Wikimedia wikis. To set it up: # Set up an owner-only OAuth 1.0a credential following the instructions at [[mw:Manual:Pywikibot/OAuth/Wikimedia#Registering your bot with the wiki software]]. The required dependencies are already installed and the configuration is included in the image. # On Toolforge, create [[Help:Toolforge/Envvars Service|environment variables]] that contain the bot username and the four values you got when creating the OAuth credential:<syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge envvars create PWB_USERNAME tools.mytool@tools-sgebastion-10:~$ toolforge envvars create PWB_CONSUMER_TOKEN tools.mytool@tools-sgebastion-10:~$ toolforge envvars create PWB_CONSUMER_SECRET tools.mytool@tools-sgebastion-10:~$ toolforge envvars create PWB_ACCESS_TOKEN tools.mytool@tools-sgebastion-10:~$ toolforge envvars create PWB_ACCESS_SECRET </syntaxhighlight> # You are now ready to run Pywikibot scripts on Toolforge. == Running scripts == Use the <code>toolforge jobs</code> utility to run jobs using the [[Help:Toolforge/Jobs framework|jobs framework]]. The <code>tool-pywikibot/pywikibot-scripts-stable:latest</code> image is maintained by the Toolforge admin team and always points to the latest stable Pywikibot release. {{Note|This image and its method of running scripts ''does not'' read or write files in the tool's $HOME directory. This includes files like [[mw:Manual:Pywikibot/user-config.py|user-config.py]] and [[mw:Manual:Pywikibot/user-fixes.py|user-fixes.py]] that you may be used to using with other Pywikibot deployments.}} === Examples === To run a single job: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" JOB_NAME </syntaxhighlight> for example, to run the [[mw:Manual:Pywikibot/redirect.py|redirect.py]] script on Wikitech, you would use: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:wikitech -lang:en redirect double -always" fix-double-redirects </syntaxhighlight> You can use the <code>--schedule</code> option to run a job [[Help:Toolforge/Jobs_framework#Creating_scheduled_jobs_(cron_jobs)|on a timer]], for example every day: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" --schedule "@daily" JOB_NAME </syntaxhighlight> == See also == * [[Help:Toolforge/Running Pywikibot scripts (advanced)]] * [[Portal:Toolforge/Admin/Pywikibot image|Toolforge admin documentation for maintaining the image]] {{:Help:Cloud Services communication}} {{DEFAULTSORT:Pywikibot scripts}} [[Category:Toolforge]] [[Category:How-to-guide]] [[Category:Python]] fbopllw0nbaaqwt5jeddn3hnxakpsay 2309660 2309659 2025-06-09T08:25:35Z Taavi-WMF 41365 /* Setup */ 2309660 wikitext text/x-wiki {{Toolforge nav}} The '''[[:mw:Special:MyLanguage/Manual:Pywikibot|Pywikibot]]''' framework is a Python library and collection of scripts to automate work on MediaWiki sites. This tutorial contains instructions on how to run '''[[:mw:Special:MyLanguage/Manual:Pywikibot/Scripts|Pywikibot's built-in scripts]]''' on Toolforge using the [[Help:Toolforge/Jobs framework|Toolforge jobs framework]]. {{Note|If you want to run a script that is not included with Pywikibot itself, follow the [[Help:Toolforge/Running Pywikibot scripts (advanced)|advanced Pywikibot tutorial]].}} == Prerequisites == To effectively use Pywikibot on Toolforge, you need: * Access to a tool account on Toolforge. See [[Help:Toolforge/Quickstart]] to learn how to set it up and use. Note that to use Toolforge you need a beginner-level understanding of Linux terminal, SSH (see [[wikibooks:Internet Technologies/SSH|Internet Technologies/SSH]]), and Bash (see [[wikibooks:Bash Shell Scripting|Bash Shell Scripting]]). * Basic familiarity with Python and Pywikibot. You don't need to know how to program in Python but it helps to understand how to run Python scripts. To learn Python, see [[wikibooks:Non-Programmer's Tutorial for Python 3|Non-Programmer's Tutorial for Python 3]] or [https://docs.python.org/3/tutorial/ The Python Tutorial in Python documentation].For information about Pywikibot, see [[mw:Special:MyLanguage/Manual:Pywikibot|Manual:Pywikibot]]. == Setup == The provided Pywikibot image uses [[mw:Manual:Pywikibot/OAuth/Wikimedia|OAuth]] for authenticating with Wikimedia wikis. To set it up: # Set up an owner-only OAuth 1.0a credential following the instructions at [[mw:Manual:Pywikibot/OAuth/Wikimedia#Registering your bot with the wiki software]]. The required dependencies are already installed and the configuration is included in the image. # On Toolforge, create [[Help:Toolforge/Envvars Service|environment variables]] that contain the bot username and the four values you got when creating the OAuth credential:<syntaxhighlight lang="shell-session"> tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_USERNAME "Example" Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_CONSUMER_TOKEN Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_CONSUMER_SECRET Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_ACCESS_TOKEN Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_ACCESS_SECRET Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): </syntaxhighlight> # You are now ready to run Pywikibot scripts on Toolforge. == Running scripts == Use the <code>toolforge jobs</code> utility to run jobs using the [[Help:Toolforge/Jobs framework|jobs framework]]. The <code>tool-pywikibot/pywikibot-scripts-stable:latest</code> image is maintained by the Toolforge admin team and always points to the latest stable Pywikibot release. {{Note|This image and its method of running scripts ''does not'' read or write files in the tool's $HOME directory. This includes files like [[mw:Manual:Pywikibot/user-config.py|user-config.py]] and [[mw:Manual:Pywikibot/user-fixes.py|user-fixes.py]] that you may be used to using with other Pywikibot deployments.}} === Examples === To run a single job: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" JOB_NAME </syntaxhighlight> for example, to run the [[mw:Manual:Pywikibot/redirect.py|redirect.py]] script on Wikitech, you would use: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:wikitech -lang:en redirect double -always" fix-double-redirects </syntaxhighlight> You can use the <code>--schedule</code> option to run a job [[Help:Toolforge/Jobs_framework#Creating_scheduled_jobs_(cron_jobs)|on a timer]], for example every day: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" --schedule "@daily" JOB_NAME </syntaxhighlight> == See also == * [[Help:Toolforge/Running Pywikibot scripts (advanced)]] * [[Portal:Toolforge/Admin/Pywikibot image|Toolforge admin documentation for maintaining the image]] {{:Help:Cloud Services communication}} {{DEFAULTSORT:Pywikibot scripts}} [[Category:Toolforge]] [[Category:How-to-guide]] [[Category:Python]] 27t9bjuulu18srsa5mm1doiu48kfout 2309661 2309660 2025-06-09T08:25:45Z Taavi-WMF 41365 /* Setup */ 2309661 wikitext text/x-wiki {{Toolforge nav}} The '''[[:mw:Special:MyLanguage/Manual:Pywikibot|Pywikibot]]''' framework is a Python library and collection of scripts to automate work on MediaWiki sites. This tutorial contains instructions on how to run '''[[:mw:Special:MyLanguage/Manual:Pywikibot/Scripts|Pywikibot's built-in scripts]]''' on Toolforge using the [[Help:Toolforge/Jobs framework|Toolforge jobs framework]]. {{Note|If you want to run a script that is not included with Pywikibot itself, follow the [[Help:Toolforge/Running Pywikibot scripts (advanced)|advanced Pywikibot tutorial]].}} == Prerequisites == To effectively use Pywikibot on Toolforge, you need: * Access to a tool account on Toolforge. See [[Help:Toolforge/Quickstart]] to learn how to set it up and use. Note that to use Toolforge you need a beginner-level understanding of Linux terminal, SSH (see [[wikibooks:Internet Technologies/SSH|Internet Technologies/SSH]]), and Bash (see [[wikibooks:Bash Shell Scripting|Bash Shell Scripting]]). * Basic familiarity with Python and Pywikibot. You don't need to know how to program in Python but it helps to understand how to run Python scripts. To learn Python, see [[wikibooks:Non-Programmer's Tutorial for Python 3|Non-Programmer's Tutorial for Python 3]] or [https://docs.python.org/3/tutorial/ The Python Tutorial in Python documentation].For information about Pywikibot, see [[mw:Special:MyLanguage/Manual:Pywikibot|Manual:Pywikibot]]. == Setup == The provided Pywikibot image uses [[mw:Manual:Pywikibot/OAuth/Wikimedia|OAuth]] for authenticating with Wikimedia wikis. To set it up: # Set up an owner-only OAuth 1.0a credential following the instructions at [[mw:Manual:Pywikibot/OAuth/Wikimedia#Registering your bot with the wiki software]]. The required dependencies are already installed and the configuration is included in the image. # On Toolforge, create [[Help:Toolforge/Envvars Service|environment variables]] that contain the bot username and the four values you got when creating the OAuth credential:<syntaxhighlight lang="shell-session"> tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_USERNAME "Example" tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_CONSUMER_TOKEN Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_CONSUMER_SECRET Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_ACCESS_TOKEN Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): tools.mytool@tools-bastion-13:~$ toolforge envvars create PWB_ACCESS_SECRET Enter the value of your envvar (prompt is hidden, hit Ctrl+C to abort): </syntaxhighlight> # You are now ready to run Pywikibot scripts on Toolforge. == Running scripts == Use the <code>toolforge jobs</code> utility to run jobs using the [[Help:Toolforge/Jobs framework|jobs framework]]. The <code>tool-pywikibot/pywikibot-scripts-stable:latest</code> image is maintained by the Toolforge admin team and always points to the latest stable Pywikibot release. {{Note|This image and its method of running scripts ''does not'' read or write files in the tool's $HOME directory. This includes files like [[mw:Manual:Pywikibot/user-config.py|user-config.py]] and [[mw:Manual:Pywikibot/user-fixes.py|user-fixes.py]] that you may be used to using with other Pywikibot deployments.}} === Examples === To run a single job: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" JOB_NAME </syntaxhighlight> for example, to run the [[mw:Manual:Pywikibot/redirect.py|redirect.py]] script on Wikitech, you would use: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:wikitech -lang:en redirect double -always" fix-double-redirects </syntaxhighlight> You can use the <code>--schedule</code> option to run a job [[Help:Toolforge/Jobs_framework#Creating_scheduled_jobs_(cron_jobs)|on a timer]], for example every day: <syntaxhighlight lang="shell-session"> tools.mytool@tools-sgebastion-10:~$ toolforge jobs run --image tool-pywikibot/pywikibot-scripts-stable:latest --command "pwb -family:PROJECT -lang:LANGUAGE SCRIPT_NAME arg1 -always" --schedule "@daily" JOB_NAME </syntaxhighlight> == See also == * [[Help:Toolforge/Running Pywikibot scripts (advanced)]] * [[Portal:Toolforge/Admin/Pywikibot image|Toolforge admin documentation for maintaining the image]] {{:Help:Cloud Services communication}} {{DEFAULTSORT:Pywikibot scripts}} [[Category:Toolforge]] [[Category:How-to-guide]] [[Category:Python]] 567lud3v1e2zxnbvrky6g6dcov12ch9 Tool:Gitlab-account-approval/Log 116 453906 2309634 2309421 2025-06-08T21:06:11Z Gitlabaccountapprovalbot 37332 mm-episodenlistedlvaupdater was rejected. 2309634 wikitext text/x-wiki <noinclude>'''Audit log of approvals''' made by [[gitlab:gitlabaccountapprovalbot|@gitlabaccountapprovalbot]]. __NOTOC__</noinclude> === 2025-06-08 === * 21:06 "mm-episodenlistedlvaupdater" was rejected (pending since 2025-03-09T21:04:06.323Z). === 2025-06-06 === * 11:06 [[gitlab:olea|@olea]] was approved. === 2025-06-05 === * 20:33 [[gitlab:encodedwp|@encodedwp]] was approved. * 15:00 [[gitlab:toluayo|@toluayo]] was approved. * 13:51 [[gitlab:arnold_lup|@arnold_lup]] was approved. * 11:54 "sdhehua" was rejected (pending since 2025-03-06T11:51:48.241Z). === 2025-06-03 === * 21:27 [[gitlab:wewakey|@wewakey]] was approved. * 12:36 "hunsimon2" was rejected (pending since 2025-03-04T12:34:56.520Z). * 11:54 "hunsimon" was rejected (pending since 2025-03-04T11:53:54.652Z). === 2025-06-02 === * 12:01 [[gitlab:jaimedes|@jaimedes]] was approved. === 2025-05-30 === * 18:00 "sathvik9105" was rejected (pending since 2025-02-28T17:59:42.867Z). * 11:21 [[gitlab:tonythomas01|@tonythomas01]] was approved. * 10:06 [[gitlab:gpsleo|@gpsleo]] was approved. === 2025-05-29 === * 22:12 [[gitlab:codynguyen1116|@codynguyen1116]] was approved. === 2025-05-28 === * 02:57 [[gitlab:saper|@saper]] was approved. === 2025-05-27 === * 21:06 [[gitlab:mohammed_qays|@mohammed_qays]] was approved. * 15:33 "satanluimm" was rejected (pending since 2025-02-25T15:32:48.101Z). === 2025-05-26 === * 23:57 "seyedali220" was rejected (pending since 2025-02-24T23:56:17.621Z). === 2025-05-21 === * 11:12 [[gitlab:guilherme|@guilherme]] was approved. === 2025-05-19 === * 13:24 [[gitlab:emojiwiki|@emojiwiki]] was approved. === 2025-05-18 === * 00:00 "xidme" was rejected (pending since 2025-02-15T23:58:56.796Z). === 2025-05-17 === * 02:39 "kdh8219" was rejected (pending since 2025-02-15T02:36:32.237Z). === 2025-05-16 === * 15:09 [[gitlab:maxbinderwmf|@maxbinderwmf]] was approved. === 2025-05-15 === * 04:30 "inspectorzer0" was rejected (pending since 2025-02-13T04:27:33.179Z). === 2025-05-14 === * 17:42 [[gitlab:llugo|@llugo]] was approved. === 2025-05-13 === * 20:18 "mmta" was rejected (pending since 2025-02-11T20:17:23.407Z). === 2025-05-11 === * 20:51 "jad" was rejected (pending since 2025-02-09T20:49:07.333Z). * 17:54 "nishchalsundan" was rejected (pending since 2025-02-09T17:52:25.761Z). * 16:39 "mohammed_abukhadra" was rejected (pending since 2025-02-09T16:39:03.730Z). === 2025-05-09 === * 09:12 [[gitlab:sirchanmp|@sirchanmp]] was approved. === 2025-05-08 === * 08:18 [[gitlab:mengeditch|@mengeditch]] was approved. === 2025-05-07 === * 03:45 "xluffy" was rejected (pending since 2025-02-05T03:45:14.181Z). === 2025-05-06 === * 16:54 "punhaniabhishek" was rejected (pending since 2025-02-04T16:53:50.758Z). * 09:36 [[gitlab:bmartinezcalvo|@bmartinezcalvo]] was approved. === 2025-05-02 === * 12:24 [[gitlab:tohaomg|@tohaomg]] was approved. * 11:48 [[gitlab:mavrikant|@mavrikant]] was approved. * 11:45 [[gitlab:daanvr|@daanvr]] was approved. === 2025-05-01 === * 09:09 "mjoerg" was rejected (pending since 2025-01-30T09:09:04.204Z). === 2025-04-30 === * 23:06 "sanskardubey" was rejected (pending since 2025-01-29T23:03:25.489Z). === 2025-04-29 === * 16:00 "geyslein" was rejected (pending since 2025-01-28T16:00:01.510Z). === 2025-04-26 === * 09:30 "anjali9027" was rejected (pending since 2025-01-25T09:28:07.064Z). === 2025-04-25 === * 18:00 "salahhazaa" was rejected (pending since 2025-01-24T17:58:30.030Z). * 15:15 [[gitlab:yiming|@yiming]] was approved. * 02:06 "mrchanmp" was rejected (pending since 2025-01-24T02:03:58.308Z). === 2025-04-23 === * 17:03 "rj2904" was rejected (pending since 2025-01-22T17:03:11.207Z). * 14:21 "nischay33" was rejected (pending since 2025-01-22T14:19:21.081Z). === 2025-04-22 === * 19:27 "dj80" was rejected (pending since 2025-01-21T19:25:28.498Z). * 14:30 [[gitlab:kaimamin|@kaimamin]] was approved. * 09:57 "debo" was rejected (pending since 2025-01-21T09:54:47.955Z). === 2025-04-21 === * 12:24 "unshell" was rejected (pending since 2025-01-20T12:21:59.686Z). === 2025-04-18 === * 15:06 [[gitlab:spartanarbinger|@spartanarbinger]] was approved. === 2025-04-16 === * 03:09 "dewey" was rejected (pending since 2025-01-15T03:06:17.488Z). === 2025-04-15 === * 19:45 "emdadul" was rejected (pending since 2025-01-14T19:42:29.285Z). === 2025-04-14 === * 06:45 [[gitlab:bcampbell804|@bcampbell804]] was approved. === 2025-04-11 === * 06:27 [[gitlab:jvanderhoop|@jvanderhoop]] was approved. === 2025-04-10 === * 04:12 "bhai420" was rejected (pending since 2025-01-09T04:10:29.430Z). === 2025-04-09 === * 05:03 "austinvarshney" was rejected (pending since 2025-01-08T05:02:34.175Z). === 2025-04-06 === * 15:36 [[gitlab:elph|@elph]] was approved. === 2025-04-02 === * 10:33 [[gitlab:ozge|@ozge]] was approved. === 2025-03-31 === * 20:15 "demandkey" was rejected (pending since 2024-12-30T20:14:23.096Z). * 15:18 [[gitlab:danyya|@danyya]] was approved. === 2025-03-28 === * 15:54 [[gitlab:rutsavi09|@rutsavi09]] was approved. * 15:54 [[gitlab:ilanen1|@ilanen1]] was approved. === 2025-03-25 === * 19:27 [[gitlab:irfo|@irfo]] was approved. * 11:54 [[gitlab:kmontalva-wmf|@kmontalva-wmf]] was approved. * 04:33 [[gitlab:paul26|@paul26]] was approved. * 04:18 "as1100k" was rejected (pending since 2024-12-24T04:18:06.813Z). === 2025-03-24 === * 11:33 "amzadkhankk" was rejected (pending since 2024-12-23T11:33:14.176Z). === 2025-03-23 === * 12:24 "wolfdo" was rejected (pending since 2024-12-22T12:23:35.056Z). === 2025-03-22 === * 09:45 [[gitlab:fjmustak|@fjmustak]] was approved. === 2025-03-20 === * 18:42 "sathishkokila" was rejected (pending since 2024-12-19T18:39:35.161Z). * 17:03 [[gitlab:alien4444|@alien4444]] was approved. * 15:27 [[gitlab:davidcoronel|@davidcoronel]] was approved. === 2025-03-19 === * 22:57 [[gitlab:r1f4t|@r1f4t]] was approved. * 19:03 "daniel24ps" was rejected (pending since 2024-12-18T19:00:21.249Z). * 14:18 [[gitlab:beepbooppenguin|@beepbooppenguin]] was approved. === 2025-03-18 === * 17:48 "rahulkundu1209" was rejected (pending since 2024-12-17T17:46:41.936Z). * 08:15 "kirtisikka972" was rejected (pending since 2024-12-17T08:13:25.487Z). === 2025-03-15 === * 13:30 "tulspal_sidhu" was rejected (pending since 2024-12-14T13:29:10.606Z). * 01:39 "peacedeadc" was rejected (pending since 2024-12-14T01:37:36.579Z). === 2025-03-14 === * 03:51 [[gitlab:chuckthebuck|@chuckthebuck]] was approved. * 02:33 "yxngtrtxll" was rejected (pending since 2024-12-13T02:31:51.658Z). === 2025-03-13 === * 14:36 [[gitlab:iccander|@iccander]] was approved. === 2025-03-12 === * 23:21 "jokerchic36" was rejected (pending since 2024-12-11T23:21:00.670Z). * 15:30 [[gitlab:naomi|@naomi]] was approved. * 15:27 [[gitlab:cobi|@cobi]] was approved. === 2025-03-11 === * 12:42 "mohitvermaxx" was rejected (pending since 2024-12-10T12:40:56.967Z). === 2025-03-10 === * 16:51 [[gitlab:nanona15dobato|@nanona15dobato]] was approved. === 2025-03-09 === * 22:39 [[gitlab:jonkolbert|@jonkolbert]] was approved. * 20:45 [[gitlab:urbanecmtest2|@urbanecmtest2]] was approved. === 2025-03-07 === * 16:54 [[gitlab:hswan|@hswan]] was approved. * 14:42 [[gitlab:atitkov|@atitkov]] was approved. * 00:42 [[gitlab:infrastruktur|@infrastruktur]] was approved. === 2025-03-06 === * 17:21 "johnmann" was rejected (pending since 2024-12-05T17:19:24.995Z). === 2025-03-05 === * 07:33 [[gitlab:monx9494|@monx9494]] was approved. === 2025-03-02 === * 21:21 "paul26" was rejected (pending since 2024-12-01T21:20:19.681Z). === 2025-03-01 === * 19:15 [[gitlab:izno|@izno]] was approved. * 12:45 [[gitlab:nyerho|@nyerho]] was approved. === 2025-02-28 === * 18:27 [[gitlab:chuckonwumelu|@chuckonwumelu]] was approved. * 13:09 "ashwinpraveengo" was rejected (pending since 2024-11-29T13:07:47.240Z). * 00:18 "eduardoaugusto" was rejected (pending since 2024-11-29T00:17:43.372Z). === 2025-02-27 === * 20:39 "volkanurl" was rejected (pending since 2024-11-28T20:37:18.101Z). === 2025-02-24 === * 21:15 [[gitlab:feeglgeef|@feeglgeef]] was approved. * 20:18 [[gitlab:piaanalysis2|@piaanalysis2]] was approved. * 19:06 [[gitlab:dhardy|@dhardy]] was approved. === 2025-02-22 === * 19:27 [[gitlab:owuh|@owuh]] was approved. === 2025-02-19 === * 16:06 [[gitlab:artemkloko|@artemkloko]] was approved. * 13:03 [[gitlab:jgafnea|@jgafnea]] was approved. === 2025-02-17 === * 16:33 [[gitlab:asmartkitten|@asmartkitten]] was approved. === 2025-02-16 === * 19:12 "gaurigupta21" was rejected (pending since 2024-11-17T19:11:07.416Z). === 2025-02-15 === * 01:18 [[gitlab:mediawiki-quickstart-ci|@mediawiki-quickstart-ci]] was approved. === 2025-02-14 === * 15:21 "nathanbnm" was rejected (pending since 2024-11-15T15:18:19.632Z). === 2025-02-13 === * 16:45 [[gitlab:priyanshuchahal|@priyanshuchahal]] was approved. * 16:42 [[gitlab:ajhalili2006|@ajhalili2006]] was approved. === 2025-02-12 === * 23:21 "monkeypatch999" was rejected (pending since 2024-11-13T23:20:38.398Z). * 06:36 [[gitlab:jainlakshita28|@jainlakshita28]] was approved. === 2025-02-11 === * 19:27 [[gitlab:matthewsm2|@matthewsm2]] was approved. === 2025-02-09 === * 16:15 "mohammed_abukhadra" was rejected (pending since 2024-11-10T16:15:18.361Z). === 2025-02-07 === * 21:33 "brennan" was rejected (pending since 2024-11-08T21:31:07.351Z). === 2025-02-06 === * 08:24 "mmta" was rejected (pending since 2024-11-07T08:22:36.724Z). * 06:21 [[gitlab:bunnypranav|@bunnypranav]] was approved. === 2025-02-05 === * 22:39 "chrissteinchen" was rejected (pending since 2024-11-06T22:38:16.673Z). === 2025-02-03 === * 07:45 "edriiic" was rejected (pending since 2024-11-04T07:44:46.849Z). * 01:12 "geppy" was rejected (pending since 2024-11-04T01:10:48.710Z). === 2025-02-02 === * 13:18 "funa-enpitu" was rejected (pending since 2024-11-03T13:15:46.065Z). === 2025-01-31 === * 23:42 "nfontes" was rejected (pending since 2024-11-01T23:39:41.755Z). * 22:51 "sbronson" was rejected (pending since 2024-11-01T22:50:31.871Z). * 00:42 [[gitlab:farid|@farid]] was approved. === 2025-01-27 === * 08:15 [[gitlab:eliza189|@eliza189]] was approved. === 2025-01-25 === * 09:51 [[gitlab:pamputt|@pamputt]] was approved. === 2025-01-23 === * 14:30 [[gitlab:lubianat|@lubianat]] was approved. * 11:45 [[gitlab:bootsa|@bootsa]] was approved. === 2025-01-21 === * 05:09 "niko" was rejected (pending since 2024-07-21T16:10:01.377Z). * 05:09 "thawizkid369777" was rejected (pending since 2024-07-18T17:42:44.493Z). * 05:09 "sarthaksingh2" was rejected (pending since 2024-07-10T11:31:30.470Z). * 05:09 "shriyakt" was rejected (pending since 2024-07-06T04:54:10.248Z). * 05:09 "akshaya" was rejected (pending since 2024-07-06T04:04:51.488Z). * 05:09 "alaka03aj" was rejected (pending since 2024-07-05T18:01:54.876Z). * 05:09 "sulochanaviji-5049" was rejected (pending since 2024-07-01T05:58:00.427Z). * 05:09 "nayanjnath" was rejected (pending since 2024-07-01T02:51:57.405Z). * 05:09 "sd44" was rejected (pending since 2024-06-30T04:28:51.436Z). * 05:09 "metavalent" was rejected (pending since 2024-06-29T01:37:14.210Z). * 05:09 "wicloudx" was rejected (pending since 2024-06-28T11:51:23.335Z). * 05:09 "debo" was rejected (pending since 2024-06-28T01:44:59.845Z). * 05:09 "bwiki" was rejected (pending since 2024-06-23T14:15:38.032Z). * 05:09 "toprak" was rejected (pending since 2024-06-23T11:35:50.819Z). * 05:09 "iristeller" was rejected (pending since 2024-06-14T20:53:48.959Z). * 05:09 "jcolvin" was rejected (pending since 2024-06-12T17:29:01.238Z). * 05:09 "kalyan" was rejected (pending since 2024-06-07T07:52:46.993Z). * 05:09 "bluecrystal" was rejected (pending since 2024-06-06T19:16:20.107Z). * 05:09 "iftttrohit" was rejected (pending since 2024-06-04T12:08:50.818Z). * 05:09 "pogpotato" was rejected (pending since 2024-06-03T17:58:21.684Z). * 05:09 "cptlausebaer" was rejected (pending since 2024-05-31T18:53:27.692Z). * 05:09 "hdevine825" was rejected (pending since 2024-05-31T17:04:18.279Z). * 05:09 "anaghaa18" was rejected (pending since 2024-05-25T19:14:31.803Z). * 05:09 "atharvanair04" was rejected (pending since 2024-05-25T14:24:52.825Z). * 05:09 "anasvemmully" was rejected (pending since 2024-05-25T06:10:27.261Z). * 05:09 "abhinavmohandas" was rejected (pending since 2024-05-25T06:05:24.825Z). * 05:09 "kksurendran06" was rejected (pending since 2024-05-25T06:04:38.082Z). * 05:09 "albertmarshall8896" was rejected (pending since 2024-05-23T09:32:05.462Z). * 05:09 "akellison" was rejected (pending since 2024-05-17T02:07:24.229Z). * 05:09 "mainowill" was rejected (pending since 2024-04-16T23:30:33.881Z). * 05:09 "bzhqc" was rejected (pending since 2024-04-16T19:50:38.676Z). * 05:09 "safan41" was rejected (pending since 2024-04-16T03:34:48.942Z). * 05:09 "mgagat" was rejected (pending since 2024-04-16T03:21:51.764Z). * 05:09 "okeamah" was rejected (pending since 2024-04-16T02:49:00.143Z). * 05:09 "xuhao61" was rejected (pending since 2024-04-15T23:45:09.083Z). * 04:47 "cybel" was rejected (pending since 2024-04-15T06:46:35.791Z). === 2025-01-20 === * 14:33 [[gitlab:your1|@your1]] was approved. === 2025-01-18 === * 10:09 [[gitlab:galrach600|@galrach600]] was approved. * 02:51 [[gitlab:blankeclair|@blankeclair]] was approved. === 2025-01-17 === * 13:57 [[gitlab:dsantamaria|@dsantamaria]] was approved. === 2025-01-15 === * 17:12 [[gitlab:smartse|@smartse]] was approved. === 2025-01-14 === * 17:03 [[gitlab:naorleizer|@naorleizer]] was approved. === 2025-01-13 === * 02:45 [[gitlab:wolf20482|@wolf20482]] was approved. === 2025-01-12 === * 17:45 [[gitlab:tamzin|@tamzin]] was approved. === 2025-01-11 === * 15:24 [[gitlab:bargioni|@bargioni]] was approved. * 14:30 [[gitlab:salelya|@salelya]] was approved. * 10:15 [[gitlab:malakatshy|@malakatshy]] was approved. * 05:21 [[gitlab:newmcpee|@newmcpee]] was approved. === 2025-01-09 === * 15:30 [[gitlab:gkyziridis|@gkyziridis]] was approved. === 2025-01-08 === * 16:21 [[gitlab:ukrface|@ukrface]] was approved. === 2024-12-28 === * 03:27 [[gitlab:twonum|@twonum]] was approved. === 2024-12-25 === * 06:09 [[gitlab:harsv567|@harsv567]] was approved. === 2024-12-21 === * 11:24 [[gitlab:amutha2002|@amutha2002]] was approved. === 2024-12-20 === * 19:51 [[gitlab:hridyeshgupta|@hridyeshgupta]] was approved. * 10:00 [[gitlab:ro-shines|@ro-shines]] was approved. * 08:09 [[gitlab:kesharwaniarpita|@kesharwaniarpita]] was approved. === 2024-12-18 === * 14:45 [[gitlab:soylacarli|@soylacarli]] was approved. === 2024-12-16 === * 20:33 [[gitlab:aleyasiddika1|@aleyasiddika1]] was approved. === 2024-12-15 === * 07:33 [[gitlab:abhishek02bhardwaj|@abhishek02bhardwaj]] was approved. === 2024-12-13 === * 13:18 [[gitlab:ashmitabathre204|@ashmitabathre204]] was approved. === 2024-12-10 === * 06:39 [[gitlab:ginaan|@ginaan]] was approved. === 2024-12-09 === * 05:45 [[gitlab:kallinavya|@kallinavya]] was approved. * 00:54 [[gitlab:viserion-7|@viserion-7]] was approved. === 2024-12-08 === * 17:27 [[gitlab:wargo|@wargo]] was approved. === 2024-12-05 === * 11:15 [[gitlab:ranjithraj|@ranjithraj]] was approved. === 2024-12-02 === * 21:21 [[gitlab:a930913|@a930913]] was approved. === 2024-12-01 === * 02:39 [[gitlab:kingchristlike1|@kingchristlike1]] was approved. === 2024-11-21 === * 13:45 [[gitlab:sascha|@sascha]] was approved. === 2024-11-19 === * 16:36 [[gitlab:jly|@jly]] was approved. === 2024-11-15 === * 02:54 [[gitlab:danielyepezgarces|@danielyepezgarces]] was approved. === 2024-11-14 === * 14:15 [[gitlab:stimoroll|@stimoroll]] was approved. === 2024-11-09 === * 17:15 [[gitlab:f4udeveloper|@f4udeveloper]] was approved. === 2024-11-07 === * 19:15 [[gitlab:zulf|@zulf]] was approved. * 05:33 [[gitlab:hassanamin|@hassanamin]] was approved. === 2024-11-06 === * 19:39 [[gitlab:daniuu|@daniuu]] was approved. * 00:18 [[gitlab:rlopez-wmf|@rlopez-wmf]] was approved. === 2024-10-09 === * 14:45 [[gitlab:jtweed|@jtweed]] was approved. * 10:24 [[gitlab:ifrahkh|@ifrahkh]] was approved. * 09:06 [[gitlab:wikibayer|@wikibayer]] was approved. === 2024-10-06 === * 10:27 [[gitlab:keerthan16|@keerthan16]] was approved. === 2024-10-04 === * 07:45 [[gitlab:hakimi97|@hakimi97]] was approved. === 2024-09-30 === * 07:39 [[gitlab:ninjastrikers|@ninjastrikers]] was approved. === 2024-09-28 === * 17:30 [[gitlab:webrunner95|@webrunner95]] was approved. === 2024-09-18 === * 21:39 [[gitlab:elliottetzkorn|@elliottetzkorn]] was approved. === 2024-09-14 === * 22:06 [[gitlab:humptydumpty|@humptydumpty]] was approved. === 2024-09-06 === * 08:48 [[gitlab:mickabarber|@mickabarber]] was approved. === 2024-08-27 === * 17:36 [[gitlab:edgars|@edgars]] was approved. === 2024-08-22 === * 09:18 [[gitlab:antonkokhwmde|@antonkokhwmde]] was approved. === 2024-08-14 === * 19:21 [[gitlab:jfk|@jfk]] was approved. === 2024-08-13 === * 17:57 [[gitlab:daxserver|@daxserver]] was approved. === 2024-08-11 === * 09:57 [[gitlab:pauliesnug|@pauliesnug]] was approved. === 2024-08-10 === * 08:42 [[gitlab:ashig|@ashig]] was approved. === 2024-08-09 === * 14:09 [[gitlab:masssly|@masssly]] was approved. === 2024-08-05 === * 22:15 [[gitlab:mrtortue|@mrtortue]] was approved. === 2024-08-02 === * 16:21 [[gitlab:dsantini|@dsantini]] was approved. === 2024-07-31 === * 11:54 [[gitlab:cptviraj|@cptviraj]] was approved. === 2024-07-30 === * 19:09 [[gitlab:iniquity|@iniquity]] was approved. * 10:00 [[gitlab:collins|@collins]] was approved. === 2024-07-27 === * 15:57 [[gitlab:songnguxyz|@songnguxyz]] was approved. === 2024-07-25 === * 12:36 [[gitlab:mszabo|@mszabo]] was approved. * 09:21 [[gitlab:agarwalmahima|@agarwalmahima]] was approved. === 2024-07-24 === * 08:05 [[gitlab:dragoniez|@dragoniez]] was approved. === 2024-07-23 === * 06:54 [[gitlab:mirji|@mirji]] was approved. === 2024-07-16 === * 10:00 [[gitlab:lakejason0|@lakejason0]] was approved. === 2024-07-12 === * 11:33 [[gitlab:cn|@cn]] was approved. * 08:12 [[gitlab:unchampignon|@unchampignon]] was approved. === 2024-07-07 === * 17:12 [[gitlab:agamyasamuel|@agamyasamuel]] was approved. * 05:24 [[gitlab:kuldeepburjbhalaike|@kuldeepburjbhalaike]] was approved. === 2024-07-06 === * 11:18 [[gitlab:dibya|@dibya]] was approved. * 04:54 [[gitlab:sarthakparashar|@sarthakparashar]] was approved. === 2024-07-05 === * 18:15 [[gitlab:vanshikarathi|@vanshikarathi]] was approved. === 2024-07-02 === * 19:00 [[gitlab:ebrahim|@ebrahim]] was approved. === 2024-07-01 === * 20:12 [[gitlab:rockingpenny4|@rockingpenny4]] was approved. * 18:15 [[gitlab:balajijagadesh|@balajijagadesh]] was approved. === 2024-06-30 === * 18:24 [[gitlab:hrideshmg|@hrideshmg]] was approved. * 07:18 [[gitlab:chanakyakumardas|@chanakyakumardas]] was approved. * 06:30 [[gitlab:rihaan180|@rihaan180]] was approved. === 2024-06-27 === * 17:36 [[gitlab:driedmueller|@driedmueller]] was approved. === 2024-06-19 === * 12:57 [[gitlab:audreypenven|@audreypenven]] was approved. === 2024-06-16 === * 01:18 [[gitlab:roysmith|@roysmith]] was approved. === 2024-06-08 === * 02:45 [[gitlab:jleedev|@jleedev]] was approved. === 2024-06-03 === * 13:57 [[gitlab:afeder|@afeder]] was approved. === 2024-06-01 === * 10:54 [[gitlab:florianschmitt|@florianschmitt]] was approved. === 2024-05-30 === * 16:42 [[gitlab:krlsca|@krlsca]] was approved. === 2024-05-28 === * 11:24 [[gitlab:rickijay|@rickijay]] was approved. === 2024-05-26 === * 11:18 [[gitlab:ranjithsiji|@ranjithsiji]] was approved. === 2024-05-25 === * 07:24 [[gitlab:jony|@jony]] was approved. === 2024-05-23 === * 08:45 [[gitlab:lepticed7|@lepticed7]] was approved. === 2024-05-22 === * 20:42 [[gitlab:echecs|@echecs]] was approved. === 2024-05-21 === * 13:33 [[gitlab:mbs|@mbs]] was approved. === 2024-05-19 === * 18:06 [[gitlab:ionenlaser|@ionenlaser]] was approved. === 2024-05-18 === * 23:36 [[gitlab:mdaniels5757|@mdaniels5757]] was approved. === 2024-05-17 === * 08:54 [[gitlab:grapedog|@grapedog]] was approved. === 2024-05-08 === * 19:42 [[gitlab:kelhurd|@kelhurd]] was approved. * 19:06 [[gitlab:khurd|@khurd]] was approved. === 2024-05-06 === * 19:48 [[gitlab:j3j5|@j3j5]] was approved. * 12:06 [[gitlab:tk-999|@tk-999]] was approved. === 2024-05-05 === * 22:09 [[gitlab:pppery|@pppery]] was approved. * 20:33 [[gitlab:sakretsu|@sakretsu]] was approved. * 12:12 [[gitlab:waterquark|@waterquark]] was approved. === 2024-05-04 === * 09:03 [[gitlab:multichill|@multichill]] was approved. * 07:42 [[gitlab:abaris|@abaris]] was approved. === 2024-05-03 === * 14:57 [[gitlab:maurusian|@maurusian]] was approved. === 2024-04-24 === * 05:48 [[gitlab:wolfinux|@wolfinux]] was approved. === 2024-04-23 === * 15:48 [[gitlab:dreamrimmer|@dreamrimmer]] was approved. === 2024-04-21 === * 06:51 [[gitlab:alon|@alon]] was approved. === 2024-04-17 === * 23:33 [[gitlab:derenrich|@derenrich]] was approved. === 2024-04-16 === * 17:18 [[gitlab:valcio|@valcio]] was approved. === 2024-04-14 === * 16:51 [[gitlab:wikilucas00|@wikilucas00]] was approved. === 2024-04-06 === * 12:48 [[gitlab:theprotonade|@theprotonade]] was approved. === 2024-04-02 === * 07:30 [[gitlab:bohuizhang|@bohuizhang]] was approved. === 2024-03-30 === * 13:36 [[gitlab:lpintscher|@lpintscher]] was approved. === 2024-03-26 === * 17:09 [[gitlab:eenabulele|@eenabulele]] was approved. === 2024-03-25 === * 14:27 [[gitlab:tuukka|@tuukka]] was approved. === 2024-03-24 === * 12:24 [[gitlab:firefly|@firefly]] was approved. === 2024-03-21 === * 19:33 [[gitlab:universal-omega|@universal-omega]] was approved. === 2024-03-17 === * 10:36 [[gitlab:bisel91|@bisel91]] was approved. === 2024-03-16 === * 10:09 [[gitlab:delord|@delord]] was approved. * 00:42 [[gitlab:athulvis1|@athulvis1]] was approved. === 2024-03-15 === * 19:06 [[gitlab:ignaciorodrguez|@ignaciorodrguez]] was approved. * 08:30 [[gitlab:peachey88|@peachey88]] was approved. * 06:51 [[gitlab:derick|@derick]] was approved. === 2024-03-12 === * 15:06 [[gitlab:xiaoxiao|@xiaoxiao]] was approved. === 2024-03-06 === * 13:21 [[gitlab:desianabae1|@desianabae1]] was approved. === 2024-03-05 === * 19:21 [[gitlab:ep1c|@ep1c]] was approved. * 16:33 [[gitlab:jasmine|@jasmine]] was approved. === 2024-03-02 === * 06:42 [[gitlab:potsdamlamb|@potsdamlamb]] was approved. === 2024-02-29 === * 23:18 [[gitlab:arandomname123|@arandomname123]] was approved. * 18:03 [[gitlab:baba|@baba]] was approved. * 17:48 [[gitlab:yfdyh000|@yfdyh000]] was approved. * 03:09 [[gitlab:sds|@sds]] was approved. === 2024-02-27 === * 23:33 [[gitlab:lofhi|@lofhi]] was approved. === 2024-02-15 === * 19:45 [[gitlab:gergesshamon|@gergesshamon]] was approved. === 2024-02-14 === * 14:33 [[gitlab:philipnelson99|@philipnelson99]] was approved. === 2024-02-13 === * 13:06 [[gitlab:dringsim|@dringsim]] was approved. === 2024-02-12 === * 17:36 [[gitlab:haak|@haak]] was approved. === 2024-02-05 === * 17:33 [[gitlab:qwerfjkl|@qwerfjkl]] was approved. * 17:14 [[gitlab:ahecht|@ahecht]] was approved. === 2024-02-01 === * 09:27 [[gitlab:arinaigum|@arinaigum]] was approved. * 00:15 [[gitlab:jas42|@jas42]] was approved. * 00:15 [[gitlab:edhu|@edhu]] was approved. * 00:15 [[gitlab:marnanel|@marnanel]] was approved. * 00:15 [[gitlab:ibrahemqasim|@ibrahemqasim]] was approved. * 00:15 [[gitlab:amasotti|@amasotti]] was approved. * 00:15 [[gitlab:deni|@deni]] was approved. * 00:15 [[gitlab:cyber|@cyber]] was approved. * 00:15 [[gitlab:saroj|@saroj]] was approved. === 2024-01-29 === * 21:42 [[gitlab:rgupta|@rgupta]] was approved. === 2024-01-07 === * 09:48 [[gitlab:lutrome|@lutrome]] was approved. === 2024-01-05 === * 20:48 [[gitlab:jinoytommanjaly|@jinoytommanjaly]] was approved. * 02:51 [[gitlab:braunobruno|@braunobruno]] was approved. * 01:08 [[gitlab:amorymeltzer|@amorymeltzer]] was approved. * 01:08 [[gitlab:phi22ipus|@phi22ipus]] was approved. === 2024-01-03 === * 14:45 [[gitlab:gabina|@gabina]] was approved. === 2024-01-02 === * 13:18 [[gitlab:arthurtaylor|@arthurtaylor]] was approved. === 2023-12-23 === * 00:33 [[gitlab:aram|@aram]] was approved. === 2023-12-22 === * 16:24 [[gitlab:elpitareio|@elpitareio]] was approved. === 2023-12-21 === * 00:43 [[gitlab:bsadowski1|@bsadowski1]] was approved. * 00:43 [[gitlab:ederporto|@ederporto]] was approved. * 00:43 [[gitlab:sadraiiali|@sadraiiali]] was approved. * 00:43 [[gitlab:wasp-outis|@wasp-outis]] was approved. * 00:43 [[gitlab:bodhisattwa|@bodhisattwa]] was approved. * 00:43 [[gitlab:air7538|@air7538]] was approved. * 00:43 [[gitlab:anzx|@anzx]] was approved. * 00:43 [[gitlab:tekask1903|@tekask1903]] was approved. * 00:42 [[gitlab:kiwi-0x010c|@kiwi-0x010c]] was approved. * 00:42 [[gitlab:mpaa|@mpaa]] was approved. * 00:42 [[gitlab:kutay|@kutay]] was approved. * 00:42 [[gitlab:wattmto|@wattmto]] was approved. jbyhlv49x0pi41szu5u8xbfikqjbt2b 2309654 2309634 2025-06-09T08:03:14Z Gitlabaccountapprovalbot 37332 a-ssh22 was rejected. 2309654 wikitext text/x-wiki <noinclude>'''Audit log of approvals''' made by [[gitlab:gitlabaccountapprovalbot|@gitlabaccountapprovalbot]]. __NOTOC__</noinclude> === 2025-06-09 === * 08:03 "a-ssh22" was rejected (pending since 2025-03-10T08:03:08.111Z). === 2025-06-08 === * 21:06 "mm-episodenlistedlvaupdater" was rejected (pending since 2025-03-09T21:04:06.323Z). === 2025-06-06 === * 11:06 [[gitlab:olea|@olea]] was approved. === 2025-06-05 === * 20:33 [[gitlab:encodedwp|@encodedwp]] was approved. * 15:00 [[gitlab:toluayo|@toluayo]] was approved. * 13:51 [[gitlab:arnold_lup|@arnold_lup]] was approved. * 11:54 "sdhehua" was rejected (pending since 2025-03-06T11:51:48.241Z). === 2025-06-03 === * 21:27 [[gitlab:wewakey|@wewakey]] was approved. * 12:36 "hunsimon2" was rejected (pending since 2025-03-04T12:34:56.520Z). * 11:54 "hunsimon" was rejected (pending since 2025-03-04T11:53:54.652Z). === 2025-06-02 === * 12:01 [[gitlab:jaimedes|@jaimedes]] was approved. === 2025-05-30 === * 18:00 "sathvik9105" was rejected (pending since 2025-02-28T17:59:42.867Z). * 11:21 [[gitlab:tonythomas01|@tonythomas01]] was approved. * 10:06 [[gitlab:gpsleo|@gpsleo]] was approved. === 2025-05-29 === * 22:12 [[gitlab:codynguyen1116|@codynguyen1116]] was approved. === 2025-05-28 === * 02:57 [[gitlab:saper|@saper]] was approved. === 2025-05-27 === * 21:06 [[gitlab:mohammed_qays|@mohammed_qays]] was approved. * 15:33 "satanluimm" was rejected (pending since 2025-02-25T15:32:48.101Z). === 2025-05-26 === * 23:57 "seyedali220" was rejected (pending since 2025-02-24T23:56:17.621Z). === 2025-05-21 === * 11:12 [[gitlab:guilherme|@guilherme]] was approved. === 2025-05-19 === * 13:24 [[gitlab:emojiwiki|@emojiwiki]] was approved. === 2025-05-18 === * 00:00 "xidme" was rejected (pending since 2025-02-15T23:58:56.796Z). === 2025-05-17 === * 02:39 "kdh8219" was rejected (pending since 2025-02-15T02:36:32.237Z). === 2025-05-16 === * 15:09 [[gitlab:maxbinderwmf|@maxbinderwmf]] was approved. === 2025-05-15 === * 04:30 "inspectorzer0" was rejected (pending since 2025-02-13T04:27:33.179Z). === 2025-05-14 === * 17:42 [[gitlab:llugo|@llugo]] was approved. === 2025-05-13 === * 20:18 "mmta" was rejected (pending since 2025-02-11T20:17:23.407Z). === 2025-05-11 === * 20:51 "jad" was rejected (pending since 2025-02-09T20:49:07.333Z). * 17:54 "nishchalsundan" was rejected (pending since 2025-02-09T17:52:25.761Z). * 16:39 "mohammed_abukhadra" was rejected (pending since 2025-02-09T16:39:03.730Z). === 2025-05-09 === * 09:12 [[gitlab:sirchanmp|@sirchanmp]] was approved. === 2025-05-08 === * 08:18 [[gitlab:mengeditch|@mengeditch]] was approved. === 2025-05-07 === * 03:45 "xluffy" was rejected (pending since 2025-02-05T03:45:14.181Z). === 2025-05-06 === * 16:54 "punhaniabhishek" was rejected (pending since 2025-02-04T16:53:50.758Z). * 09:36 [[gitlab:bmartinezcalvo|@bmartinezcalvo]] was approved. === 2025-05-02 === * 12:24 [[gitlab:tohaomg|@tohaomg]] was approved. * 11:48 [[gitlab:mavrikant|@mavrikant]] was approved. * 11:45 [[gitlab:daanvr|@daanvr]] was approved. === 2025-05-01 === * 09:09 "mjoerg" was rejected (pending since 2025-01-30T09:09:04.204Z). === 2025-04-30 === * 23:06 "sanskardubey" was rejected (pending since 2025-01-29T23:03:25.489Z). === 2025-04-29 === * 16:00 "geyslein" was rejected (pending since 2025-01-28T16:00:01.510Z). === 2025-04-26 === * 09:30 "anjali9027" was rejected (pending since 2025-01-25T09:28:07.064Z). === 2025-04-25 === * 18:00 "salahhazaa" was rejected (pending since 2025-01-24T17:58:30.030Z). * 15:15 [[gitlab:yiming|@yiming]] was approved. * 02:06 "mrchanmp" was rejected (pending since 2025-01-24T02:03:58.308Z). === 2025-04-23 === * 17:03 "rj2904" was rejected (pending since 2025-01-22T17:03:11.207Z). * 14:21 "nischay33" was rejected (pending since 2025-01-22T14:19:21.081Z). === 2025-04-22 === * 19:27 "dj80" was rejected (pending since 2025-01-21T19:25:28.498Z). * 14:30 [[gitlab:kaimamin|@kaimamin]] was approved. * 09:57 "debo" was rejected (pending since 2025-01-21T09:54:47.955Z). === 2025-04-21 === * 12:24 "unshell" was rejected (pending since 2025-01-20T12:21:59.686Z). === 2025-04-18 === * 15:06 [[gitlab:spartanarbinger|@spartanarbinger]] was approved. === 2025-04-16 === * 03:09 "dewey" was rejected (pending since 2025-01-15T03:06:17.488Z). === 2025-04-15 === * 19:45 "emdadul" was rejected (pending since 2025-01-14T19:42:29.285Z). === 2025-04-14 === * 06:45 [[gitlab:bcampbell804|@bcampbell804]] was approved. === 2025-04-11 === * 06:27 [[gitlab:jvanderhoop|@jvanderhoop]] was approved. === 2025-04-10 === * 04:12 "bhai420" was rejected (pending since 2025-01-09T04:10:29.430Z). === 2025-04-09 === * 05:03 "austinvarshney" was rejected (pending since 2025-01-08T05:02:34.175Z). === 2025-04-06 === * 15:36 [[gitlab:elph|@elph]] was approved. === 2025-04-02 === * 10:33 [[gitlab:ozge|@ozge]] was approved. === 2025-03-31 === * 20:15 "demandkey" was rejected (pending since 2024-12-30T20:14:23.096Z). * 15:18 [[gitlab:danyya|@danyya]] was approved. === 2025-03-28 === * 15:54 [[gitlab:rutsavi09|@rutsavi09]] was approved. * 15:54 [[gitlab:ilanen1|@ilanen1]] was approved. === 2025-03-25 === * 19:27 [[gitlab:irfo|@irfo]] was approved. * 11:54 [[gitlab:kmontalva-wmf|@kmontalva-wmf]] was approved. * 04:33 [[gitlab:paul26|@paul26]] was approved. * 04:18 "as1100k" was rejected (pending since 2024-12-24T04:18:06.813Z). === 2025-03-24 === * 11:33 "amzadkhankk" was rejected (pending since 2024-12-23T11:33:14.176Z). === 2025-03-23 === * 12:24 "wolfdo" was rejected (pending since 2024-12-22T12:23:35.056Z). === 2025-03-22 === * 09:45 [[gitlab:fjmustak|@fjmustak]] was approved. === 2025-03-20 === * 18:42 "sathishkokila" was rejected (pending since 2024-12-19T18:39:35.161Z). * 17:03 [[gitlab:alien4444|@alien4444]] was approved. * 15:27 [[gitlab:davidcoronel|@davidcoronel]] was approved. === 2025-03-19 === * 22:57 [[gitlab:r1f4t|@r1f4t]] was approved. * 19:03 "daniel24ps" was rejected (pending since 2024-12-18T19:00:21.249Z). * 14:18 [[gitlab:beepbooppenguin|@beepbooppenguin]] was approved. === 2025-03-18 === * 17:48 "rahulkundu1209" was rejected (pending since 2024-12-17T17:46:41.936Z). * 08:15 "kirtisikka972" was rejected (pending since 2024-12-17T08:13:25.487Z). === 2025-03-15 === * 13:30 "tulspal_sidhu" was rejected (pending since 2024-12-14T13:29:10.606Z). * 01:39 "peacedeadc" was rejected (pending since 2024-12-14T01:37:36.579Z). === 2025-03-14 === * 03:51 [[gitlab:chuckthebuck|@chuckthebuck]] was approved. * 02:33 "yxngtrtxll" was rejected (pending since 2024-12-13T02:31:51.658Z). === 2025-03-13 === * 14:36 [[gitlab:iccander|@iccander]] was approved. === 2025-03-12 === * 23:21 "jokerchic36" was rejected (pending since 2024-12-11T23:21:00.670Z). * 15:30 [[gitlab:naomi|@naomi]] was approved. * 15:27 [[gitlab:cobi|@cobi]] was approved. === 2025-03-11 === * 12:42 "mohitvermaxx" was rejected (pending since 2024-12-10T12:40:56.967Z). === 2025-03-10 === * 16:51 [[gitlab:nanona15dobato|@nanona15dobato]] was approved. === 2025-03-09 === * 22:39 [[gitlab:jonkolbert|@jonkolbert]] was approved. * 20:45 [[gitlab:urbanecmtest2|@urbanecmtest2]] was approved. === 2025-03-07 === * 16:54 [[gitlab:hswan|@hswan]] was approved. * 14:42 [[gitlab:atitkov|@atitkov]] was approved. * 00:42 [[gitlab:infrastruktur|@infrastruktur]] was approved. === 2025-03-06 === * 17:21 "johnmann" was rejected (pending since 2024-12-05T17:19:24.995Z). === 2025-03-05 === * 07:33 [[gitlab:monx9494|@monx9494]] was approved. === 2025-03-02 === * 21:21 "paul26" was rejected (pending since 2024-12-01T21:20:19.681Z). === 2025-03-01 === * 19:15 [[gitlab:izno|@izno]] was approved. * 12:45 [[gitlab:nyerho|@nyerho]] was approved. === 2025-02-28 === * 18:27 [[gitlab:chuckonwumelu|@chuckonwumelu]] was approved. * 13:09 "ashwinpraveengo" was rejected (pending since 2024-11-29T13:07:47.240Z). * 00:18 "eduardoaugusto" was rejected (pending since 2024-11-29T00:17:43.372Z). === 2025-02-27 === * 20:39 "volkanurl" was rejected (pending since 2024-11-28T20:37:18.101Z). === 2025-02-24 === * 21:15 [[gitlab:feeglgeef|@feeglgeef]] was approved. * 20:18 [[gitlab:piaanalysis2|@piaanalysis2]] was approved. * 19:06 [[gitlab:dhardy|@dhardy]] was approved. === 2025-02-22 === * 19:27 [[gitlab:owuh|@owuh]] was approved. === 2025-02-19 === * 16:06 [[gitlab:artemkloko|@artemkloko]] was approved. * 13:03 [[gitlab:jgafnea|@jgafnea]] was approved. === 2025-02-17 === * 16:33 [[gitlab:asmartkitten|@asmartkitten]] was approved. === 2025-02-16 === * 19:12 "gaurigupta21" was rejected (pending since 2024-11-17T19:11:07.416Z). === 2025-02-15 === * 01:18 [[gitlab:mediawiki-quickstart-ci|@mediawiki-quickstart-ci]] was approved. === 2025-02-14 === * 15:21 "nathanbnm" was rejected (pending since 2024-11-15T15:18:19.632Z). === 2025-02-13 === * 16:45 [[gitlab:priyanshuchahal|@priyanshuchahal]] was approved. * 16:42 [[gitlab:ajhalili2006|@ajhalili2006]] was approved. === 2025-02-12 === * 23:21 "monkeypatch999" was rejected (pending since 2024-11-13T23:20:38.398Z). * 06:36 [[gitlab:jainlakshita28|@jainlakshita28]] was approved. === 2025-02-11 === * 19:27 [[gitlab:matthewsm2|@matthewsm2]] was approved. === 2025-02-09 === * 16:15 "mohammed_abukhadra" was rejected (pending since 2024-11-10T16:15:18.361Z). === 2025-02-07 === * 21:33 "brennan" was rejected (pending since 2024-11-08T21:31:07.351Z). === 2025-02-06 === * 08:24 "mmta" was rejected (pending since 2024-11-07T08:22:36.724Z). * 06:21 [[gitlab:bunnypranav|@bunnypranav]] was approved. === 2025-02-05 === * 22:39 "chrissteinchen" was rejected (pending since 2024-11-06T22:38:16.673Z). === 2025-02-03 === * 07:45 "edriiic" was rejected (pending since 2024-11-04T07:44:46.849Z). * 01:12 "geppy" was rejected (pending since 2024-11-04T01:10:48.710Z). === 2025-02-02 === * 13:18 "funa-enpitu" was rejected (pending since 2024-11-03T13:15:46.065Z). === 2025-01-31 === * 23:42 "nfontes" was rejected (pending since 2024-11-01T23:39:41.755Z). * 22:51 "sbronson" was rejected (pending since 2024-11-01T22:50:31.871Z). * 00:42 [[gitlab:farid|@farid]] was approved. === 2025-01-27 === * 08:15 [[gitlab:eliza189|@eliza189]] was approved. === 2025-01-25 === * 09:51 [[gitlab:pamputt|@pamputt]] was approved. === 2025-01-23 === * 14:30 [[gitlab:lubianat|@lubianat]] was approved. * 11:45 [[gitlab:bootsa|@bootsa]] was approved. === 2025-01-21 === * 05:09 "niko" was rejected (pending since 2024-07-21T16:10:01.377Z). * 05:09 "thawizkid369777" was rejected (pending since 2024-07-18T17:42:44.493Z). * 05:09 "sarthaksingh2" was rejected (pending since 2024-07-10T11:31:30.470Z). * 05:09 "shriyakt" was rejected (pending since 2024-07-06T04:54:10.248Z). * 05:09 "akshaya" was rejected (pending since 2024-07-06T04:04:51.488Z). * 05:09 "alaka03aj" was rejected (pending since 2024-07-05T18:01:54.876Z). * 05:09 "sulochanaviji-5049" was rejected (pending since 2024-07-01T05:58:00.427Z). * 05:09 "nayanjnath" was rejected (pending since 2024-07-01T02:51:57.405Z). * 05:09 "sd44" was rejected (pending since 2024-06-30T04:28:51.436Z). * 05:09 "metavalent" was rejected (pending since 2024-06-29T01:37:14.210Z). * 05:09 "wicloudx" was rejected (pending since 2024-06-28T11:51:23.335Z). * 05:09 "debo" was rejected (pending since 2024-06-28T01:44:59.845Z). * 05:09 "bwiki" was rejected (pending since 2024-06-23T14:15:38.032Z). * 05:09 "toprak" was rejected (pending since 2024-06-23T11:35:50.819Z). * 05:09 "iristeller" was rejected (pending since 2024-06-14T20:53:48.959Z). * 05:09 "jcolvin" was rejected (pending since 2024-06-12T17:29:01.238Z). * 05:09 "kalyan" was rejected (pending since 2024-06-07T07:52:46.993Z). * 05:09 "bluecrystal" was rejected (pending since 2024-06-06T19:16:20.107Z). * 05:09 "iftttrohit" was rejected (pending since 2024-06-04T12:08:50.818Z). * 05:09 "pogpotato" was rejected (pending since 2024-06-03T17:58:21.684Z). * 05:09 "cptlausebaer" was rejected (pending since 2024-05-31T18:53:27.692Z). * 05:09 "hdevine825" was rejected (pending since 2024-05-31T17:04:18.279Z). * 05:09 "anaghaa18" was rejected (pending since 2024-05-25T19:14:31.803Z). * 05:09 "atharvanair04" was rejected (pending since 2024-05-25T14:24:52.825Z). * 05:09 "anasvemmully" was rejected (pending since 2024-05-25T06:10:27.261Z). * 05:09 "abhinavmohandas" was rejected (pending since 2024-05-25T06:05:24.825Z). * 05:09 "kksurendran06" was rejected (pending since 2024-05-25T06:04:38.082Z). * 05:09 "albertmarshall8896" was rejected (pending since 2024-05-23T09:32:05.462Z). * 05:09 "akellison" was rejected (pending since 2024-05-17T02:07:24.229Z). * 05:09 "mainowill" was rejected (pending since 2024-04-16T23:30:33.881Z). * 05:09 "bzhqc" was rejected (pending since 2024-04-16T19:50:38.676Z). * 05:09 "safan41" was rejected (pending since 2024-04-16T03:34:48.942Z). * 05:09 "mgagat" was rejected (pending since 2024-04-16T03:21:51.764Z). * 05:09 "okeamah" was rejected (pending since 2024-04-16T02:49:00.143Z). * 05:09 "xuhao61" was rejected (pending since 2024-04-15T23:45:09.083Z). * 04:47 "cybel" was rejected (pending since 2024-04-15T06:46:35.791Z). === 2025-01-20 === * 14:33 [[gitlab:your1|@your1]] was approved. === 2025-01-18 === * 10:09 [[gitlab:galrach600|@galrach600]] was approved. * 02:51 [[gitlab:blankeclair|@blankeclair]] was approved. === 2025-01-17 === * 13:57 [[gitlab:dsantamaria|@dsantamaria]] was approved. === 2025-01-15 === * 17:12 [[gitlab:smartse|@smartse]] was approved. === 2025-01-14 === * 17:03 [[gitlab:naorleizer|@naorleizer]] was approved. === 2025-01-13 === * 02:45 [[gitlab:wolf20482|@wolf20482]] was approved. === 2025-01-12 === * 17:45 [[gitlab:tamzin|@tamzin]] was approved. === 2025-01-11 === * 15:24 [[gitlab:bargioni|@bargioni]] was approved. * 14:30 [[gitlab:salelya|@salelya]] was approved. * 10:15 [[gitlab:malakatshy|@malakatshy]] was approved. * 05:21 [[gitlab:newmcpee|@newmcpee]] was approved. === 2025-01-09 === * 15:30 [[gitlab:gkyziridis|@gkyziridis]] was approved. === 2025-01-08 === * 16:21 [[gitlab:ukrface|@ukrface]] was approved. === 2024-12-28 === * 03:27 [[gitlab:twonum|@twonum]] was approved. === 2024-12-25 === * 06:09 [[gitlab:harsv567|@harsv567]] was approved. === 2024-12-21 === * 11:24 [[gitlab:amutha2002|@amutha2002]] was approved. === 2024-12-20 === * 19:51 [[gitlab:hridyeshgupta|@hridyeshgupta]] was approved. * 10:00 [[gitlab:ro-shines|@ro-shines]] was approved. * 08:09 [[gitlab:kesharwaniarpita|@kesharwaniarpita]] was approved. === 2024-12-18 === * 14:45 [[gitlab:soylacarli|@soylacarli]] was approved. === 2024-12-16 === * 20:33 [[gitlab:aleyasiddika1|@aleyasiddika1]] was approved. === 2024-12-15 === * 07:33 [[gitlab:abhishek02bhardwaj|@abhishek02bhardwaj]] was approved. === 2024-12-13 === * 13:18 [[gitlab:ashmitabathre204|@ashmitabathre204]] was approved. === 2024-12-10 === * 06:39 [[gitlab:ginaan|@ginaan]] was approved. === 2024-12-09 === * 05:45 [[gitlab:kallinavya|@kallinavya]] was approved. * 00:54 [[gitlab:viserion-7|@viserion-7]] was approved. === 2024-12-08 === * 17:27 [[gitlab:wargo|@wargo]] was approved. === 2024-12-05 === * 11:15 [[gitlab:ranjithraj|@ranjithraj]] was approved. === 2024-12-02 === * 21:21 [[gitlab:a930913|@a930913]] was approved. === 2024-12-01 === * 02:39 [[gitlab:kingchristlike1|@kingchristlike1]] was approved. === 2024-11-21 === * 13:45 [[gitlab:sascha|@sascha]] was approved. === 2024-11-19 === * 16:36 [[gitlab:jly|@jly]] was approved. === 2024-11-15 === * 02:54 [[gitlab:danielyepezgarces|@danielyepezgarces]] was approved. === 2024-11-14 === * 14:15 [[gitlab:stimoroll|@stimoroll]] was approved. === 2024-11-09 === * 17:15 [[gitlab:f4udeveloper|@f4udeveloper]] was approved. === 2024-11-07 === * 19:15 [[gitlab:zulf|@zulf]] was approved. * 05:33 [[gitlab:hassanamin|@hassanamin]] was approved. === 2024-11-06 === * 19:39 [[gitlab:daniuu|@daniuu]] was approved. * 00:18 [[gitlab:rlopez-wmf|@rlopez-wmf]] was approved. === 2024-10-09 === * 14:45 [[gitlab:jtweed|@jtweed]] was approved. * 10:24 [[gitlab:ifrahkh|@ifrahkh]] was approved. * 09:06 [[gitlab:wikibayer|@wikibayer]] was approved. === 2024-10-06 === * 10:27 [[gitlab:keerthan16|@keerthan16]] was approved. === 2024-10-04 === * 07:45 [[gitlab:hakimi97|@hakimi97]] was approved. === 2024-09-30 === * 07:39 [[gitlab:ninjastrikers|@ninjastrikers]] was approved. === 2024-09-28 === * 17:30 [[gitlab:webrunner95|@webrunner95]] was approved. === 2024-09-18 === * 21:39 [[gitlab:elliottetzkorn|@elliottetzkorn]] was approved. === 2024-09-14 === * 22:06 [[gitlab:humptydumpty|@humptydumpty]] was approved. === 2024-09-06 === * 08:48 [[gitlab:mickabarber|@mickabarber]] was approved. === 2024-08-27 === * 17:36 [[gitlab:edgars|@edgars]] was approved. === 2024-08-22 === * 09:18 [[gitlab:antonkokhwmde|@antonkokhwmde]] was approved. === 2024-08-14 === * 19:21 [[gitlab:jfk|@jfk]] was approved. === 2024-08-13 === * 17:57 [[gitlab:daxserver|@daxserver]] was approved. === 2024-08-11 === * 09:57 [[gitlab:pauliesnug|@pauliesnug]] was approved. === 2024-08-10 === * 08:42 [[gitlab:ashig|@ashig]] was approved. === 2024-08-09 === * 14:09 [[gitlab:masssly|@masssly]] was approved. === 2024-08-05 === * 22:15 [[gitlab:mrtortue|@mrtortue]] was approved. === 2024-08-02 === * 16:21 [[gitlab:dsantini|@dsantini]] was approved. === 2024-07-31 === * 11:54 [[gitlab:cptviraj|@cptviraj]] was approved. === 2024-07-30 === * 19:09 [[gitlab:iniquity|@iniquity]] was approved. * 10:00 [[gitlab:collins|@collins]] was approved. === 2024-07-27 === * 15:57 [[gitlab:songnguxyz|@songnguxyz]] was approved. === 2024-07-25 === * 12:36 [[gitlab:mszabo|@mszabo]] was approved. * 09:21 [[gitlab:agarwalmahima|@agarwalmahima]] was approved. === 2024-07-24 === * 08:05 [[gitlab:dragoniez|@dragoniez]] was approved. === 2024-07-23 === * 06:54 [[gitlab:mirji|@mirji]] was approved. === 2024-07-16 === * 10:00 [[gitlab:lakejason0|@lakejason0]] was approved. === 2024-07-12 === * 11:33 [[gitlab:cn|@cn]] was approved. * 08:12 [[gitlab:unchampignon|@unchampignon]] was approved. === 2024-07-07 === * 17:12 [[gitlab:agamyasamuel|@agamyasamuel]] was approved. * 05:24 [[gitlab:kuldeepburjbhalaike|@kuldeepburjbhalaike]] was approved. === 2024-07-06 === * 11:18 [[gitlab:dibya|@dibya]] was approved. * 04:54 [[gitlab:sarthakparashar|@sarthakparashar]] was approved. === 2024-07-05 === * 18:15 [[gitlab:vanshikarathi|@vanshikarathi]] was approved. === 2024-07-02 === * 19:00 [[gitlab:ebrahim|@ebrahim]] was approved. === 2024-07-01 === * 20:12 [[gitlab:rockingpenny4|@rockingpenny4]] was approved. * 18:15 [[gitlab:balajijagadesh|@balajijagadesh]] was approved. === 2024-06-30 === * 18:24 [[gitlab:hrideshmg|@hrideshmg]] was approved. * 07:18 [[gitlab:chanakyakumardas|@chanakyakumardas]] was approved. * 06:30 [[gitlab:rihaan180|@rihaan180]] was approved. === 2024-06-27 === * 17:36 [[gitlab:driedmueller|@driedmueller]] was approved. === 2024-06-19 === * 12:57 [[gitlab:audreypenven|@audreypenven]] was approved. === 2024-06-16 === * 01:18 [[gitlab:roysmith|@roysmith]] was approved. === 2024-06-08 === * 02:45 [[gitlab:jleedev|@jleedev]] was approved. === 2024-06-03 === * 13:57 [[gitlab:afeder|@afeder]] was approved. === 2024-06-01 === * 10:54 [[gitlab:florianschmitt|@florianschmitt]] was approved. === 2024-05-30 === * 16:42 [[gitlab:krlsca|@krlsca]] was approved. === 2024-05-28 === * 11:24 [[gitlab:rickijay|@rickijay]] was approved. === 2024-05-26 === * 11:18 [[gitlab:ranjithsiji|@ranjithsiji]] was approved. === 2024-05-25 === * 07:24 [[gitlab:jony|@jony]] was approved. === 2024-05-23 === * 08:45 [[gitlab:lepticed7|@lepticed7]] was approved. === 2024-05-22 === * 20:42 [[gitlab:echecs|@echecs]] was approved. === 2024-05-21 === * 13:33 [[gitlab:mbs|@mbs]] was approved. === 2024-05-19 === * 18:06 [[gitlab:ionenlaser|@ionenlaser]] was approved. === 2024-05-18 === * 23:36 [[gitlab:mdaniels5757|@mdaniels5757]] was approved. === 2024-05-17 === * 08:54 [[gitlab:grapedog|@grapedog]] was approved. === 2024-05-08 === * 19:42 [[gitlab:kelhurd|@kelhurd]] was approved. * 19:06 [[gitlab:khurd|@khurd]] was approved. === 2024-05-06 === * 19:48 [[gitlab:j3j5|@j3j5]] was approved. * 12:06 [[gitlab:tk-999|@tk-999]] was approved. === 2024-05-05 === * 22:09 [[gitlab:pppery|@pppery]] was approved. * 20:33 [[gitlab:sakretsu|@sakretsu]] was approved. * 12:12 [[gitlab:waterquark|@waterquark]] was approved. === 2024-05-04 === * 09:03 [[gitlab:multichill|@multichill]] was approved. * 07:42 [[gitlab:abaris|@abaris]] was approved. === 2024-05-03 === * 14:57 [[gitlab:maurusian|@maurusian]] was approved. === 2024-04-24 === * 05:48 [[gitlab:wolfinux|@wolfinux]] was approved. === 2024-04-23 === * 15:48 [[gitlab:dreamrimmer|@dreamrimmer]] was approved. === 2024-04-21 === * 06:51 [[gitlab:alon|@alon]] was approved. === 2024-04-17 === * 23:33 [[gitlab:derenrich|@derenrich]] was approved. === 2024-04-16 === * 17:18 [[gitlab:valcio|@valcio]] was approved. === 2024-04-14 === * 16:51 [[gitlab:wikilucas00|@wikilucas00]] was approved. === 2024-04-06 === * 12:48 [[gitlab:theprotonade|@theprotonade]] was approved. === 2024-04-02 === * 07:30 [[gitlab:bohuizhang|@bohuizhang]] was approved. === 2024-03-30 === * 13:36 [[gitlab:lpintscher|@lpintscher]] was approved. === 2024-03-26 === * 17:09 [[gitlab:eenabulele|@eenabulele]] was approved. === 2024-03-25 === * 14:27 [[gitlab:tuukka|@tuukka]] was approved. === 2024-03-24 === * 12:24 [[gitlab:firefly|@firefly]] was approved. === 2024-03-21 === * 19:33 [[gitlab:universal-omega|@universal-omega]] was approved. === 2024-03-17 === * 10:36 [[gitlab:bisel91|@bisel91]] was approved. === 2024-03-16 === * 10:09 [[gitlab:delord|@delord]] was approved. * 00:42 [[gitlab:athulvis1|@athulvis1]] was approved. === 2024-03-15 === * 19:06 [[gitlab:ignaciorodrguez|@ignaciorodrguez]] was approved. * 08:30 [[gitlab:peachey88|@peachey88]] was approved. * 06:51 [[gitlab:derick|@derick]] was approved. === 2024-03-12 === * 15:06 [[gitlab:xiaoxiao|@xiaoxiao]] was approved. === 2024-03-06 === * 13:21 [[gitlab:desianabae1|@desianabae1]] was approved. === 2024-03-05 === * 19:21 [[gitlab:ep1c|@ep1c]] was approved. * 16:33 [[gitlab:jasmine|@jasmine]] was approved. === 2024-03-02 === * 06:42 [[gitlab:potsdamlamb|@potsdamlamb]] was approved. === 2024-02-29 === * 23:18 [[gitlab:arandomname123|@arandomname123]] was approved. * 18:03 [[gitlab:baba|@baba]] was approved. * 17:48 [[gitlab:yfdyh000|@yfdyh000]] was approved. * 03:09 [[gitlab:sds|@sds]] was approved. === 2024-02-27 === * 23:33 [[gitlab:lofhi|@lofhi]] was approved. === 2024-02-15 === * 19:45 [[gitlab:gergesshamon|@gergesshamon]] was approved. === 2024-02-14 === * 14:33 [[gitlab:philipnelson99|@philipnelson99]] was approved. === 2024-02-13 === * 13:06 [[gitlab:dringsim|@dringsim]] was approved. === 2024-02-12 === * 17:36 [[gitlab:haak|@haak]] was approved. === 2024-02-05 === * 17:33 [[gitlab:qwerfjkl|@qwerfjkl]] was approved. * 17:14 [[gitlab:ahecht|@ahecht]] was approved. === 2024-02-01 === * 09:27 [[gitlab:arinaigum|@arinaigum]] was approved. * 00:15 [[gitlab:jas42|@jas42]] was approved. * 00:15 [[gitlab:edhu|@edhu]] was approved. * 00:15 [[gitlab:marnanel|@marnanel]] was approved. * 00:15 [[gitlab:ibrahemqasim|@ibrahemqasim]] was approved. * 00:15 [[gitlab:amasotti|@amasotti]] was approved. * 00:15 [[gitlab:deni|@deni]] was approved. * 00:15 [[gitlab:cyber|@cyber]] was approved. * 00:15 [[gitlab:saroj|@saroj]] was approved. === 2024-01-29 === * 21:42 [[gitlab:rgupta|@rgupta]] was approved. === 2024-01-07 === * 09:48 [[gitlab:lutrome|@lutrome]] was approved. === 2024-01-05 === * 20:48 [[gitlab:jinoytommanjaly|@jinoytommanjaly]] was approved. * 02:51 [[gitlab:braunobruno|@braunobruno]] was approved. * 01:08 [[gitlab:amorymeltzer|@amorymeltzer]] was approved. * 01:08 [[gitlab:phi22ipus|@phi22ipus]] was approved. === 2024-01-03 === * 14:45 [[gitlab:gabina|@gabina]] was approved. === 2024-01-02 === * 13:18 [[gitlab:arthurtaylor|@arthurtaylor]] was approved. === 2023-12-23 === * 00:33 [[gitlab:aram|@aram]] was approved. === 2023-12-22 === * 16:24 [[gitlab:elpitareio|@elpitareio]] was approved. === 2023-12-21 === * 00:43 [[gitlab:bsadowski1|@bsadowski1]] was approved. * 00:43 [[gitlab:ederporto|@ederporto]] was approved. * 00:43 [[gitlab:sadraiiali|@sadraiiali]] was approved. * 00:43 [[gitlab:wasp-outis|@wasp-outis]] was approved. * 00:43 [[gitlab:bodhisattwa|@bodhisattwa]] was approved. * 00:43 [[gitlab:air7538|@air7538]] was approved. * 00:43 [[gitlab:anzx|@anzx]] was approved. * 00:43 [[gitlab:tekask1903|@tekask1903]] was approved. * 00:42 [[gitlab:kiwi-0x010c|@kiwi-0x010c]] was approved. * 00:42 [[gitlab:mpaa|@mpaa]] was approved. * 00:42 [[gitlab:kutay|@kutay]] was approved. * 00:42 [[gitlab:wattmto|@wattmto]] was approved. ilrzp5pke9ukm7vs06if4airhjzwiuj 2309679 2309654 2025-06-09T09:33:20Z Gitlabaccountapprovalbot 37332 @mmta was approved. 2309679 wikitext text/x-wiki <noinclude>'''Audit log of approvals''' made by [[gitlab:gitlabaccountapprovalbot|@gitlabaccountapprovalbot]]. __NOTOC__</noinclude> === 2025-06-09 === * 09:33 [[gitlab:mmta|@mmta]] was approved. * 08:03 "a-ssh22" was rejected (pending since 2025-03-10T08:03:08.111Z). === 2025-06-08 === * 21:06 "mm-episodenlistedlvaupdater" was rejected (pending since 2025-03-09T21:04:06.323Z). === 2025-06-06 === * 11:06 [[gitlab:olea|@olea]] was approved. === 2025-06-05 === * 20:33 [[gitlab:encodedwp|@encodedwp]] was approved. * 15:00 [[gitlab:toluayo|@toluayo]] was approved. * 13:51 [[gitlab:arnold_lup|@arnold_lup]] was approved. * 11:54 "sdhehua" was rejected (pending since 2025-03-06T11:51:48.241Z). === 2025-06-03 === * 21:27 [[gitlab:wewakey|@wewakey]] was approved. * 12:36 "hunsimon2" was rejected (pending since 2025-03-04T12:34:56.520Z). * 11:54 "hunsimon" was rejected (pending since 2025-03-04T11:53:54.652Z). === 2025-06-02 === * 12:01 [[gitlab:jaimedes|@jaimedes]] was approved. === 2025-05-30 === * 18:00 "sathvik9105" was rejected (pending since 2025-02-28T17:59:42.867Z). * 11:21 [[gitlab:tonythomas01|@tonythomas01]] was approved. * 10:06 [[gitlab:gpsleo|@gpsleo]] was approved. === 2025-05-29 === * 22:12 [[gitlab:codynguyen1116|@codynguyen1116]] was approved. === 2025-05-28 === * 02:57 [[gitlab:saper|@saper]] was approved. === 2025-05-27 === * 21:06 [[gitlab:mohammed_qays|@mohammed_qays]] was approved. * 15:33 "satanluimm" was rejected (pending since 2025-02-25T15:32:48.101Z). === 2025-05-26 === * 23:57 "seyedali220" was rejected (pending since 2025-02-24T23:56:17.621Z). === 2025-05-21 === * 11:12 [[gitlab:guilherme|@guilherme]] was approved. === 2025-05-19 === * 13:24 [[gitlab:emojiwiki|@emojiwiki]] was approved. === 2025-05-18 === * 00:00 "xidme" was rejected (pending since 2025-02-15T23:58:56.796Z). === 2025-05-17 === * 02:39 "kdh8219" was rejected (pending since 2025-02-15T02:36:32.237Z). === 2025-05-16 === * 15:09 [[gitlab:maxbinderwmf|@maxbinderwmf]] was approved. === 2025-05-15 === * 04:30 "inspectorzer0" was rejected (pending since 2025-02-13T04:27:33.179Z). === 2025-05-14 === * 17:42 [[gitlab:llugo|@llugo]] was approved. === 2025-05-13 === * 20:18 "mmta" was rejected (pending since 2025-02-11T20:17:23.407Z). === 2025-05-11 === * 20:51 "jad" was rejected (pending since 2025-02-09T20:49:07.333Z). * 17:54 "nishchalsundan" was rejected (pending since 2025-02-09T17:52:25.761Z). * 16:39 "mohammed_abukhadra" was rejected (pending since 2025-02-09T16:39:03.730Z). === 2025-05-09 === * 09:12 [[gitlab:sirchanmp|@sirchanmp]] was approved. === 2025-05-08 === * 08:18 [[gitlab:mengeditch|@mengeditch]] was approved. === 2025-05-07 === * 03:45 "xluffy" was rejected (pending since 2025-02-05T03:45:14.181Z). === 2025-05-06 === * 16:54 "punhaniabhishek" was rejected (pending since 2025-02-04T16:53:50.758Z). * 09:36 [[gitlab:bmartinezcalvo|@bmartinezcalvo]] was approved. === 2025-05-02 === * 12:24 [[gitlab:tohaomg|@tohaomg]] was approved. * 11:48 [[gitlab:mavrikant|@mavrikant]] was approved. * 11:45 [[gitlab:daanvr|@daanvr]] was approved. === 2025-05-01 === * 09:09 "mjoerg" was rejected (pending since 2025-01-30T09:09:04.204Z). === 2025-04-30 === * 23:06 "sanskardubey" was rejected (pending since 2025-01-29T23:03:25.489Z). === 2025-04-29 === * 16:00 "geyslein" was rejected (pending since 2025-01-28T16:00:01.510Z). === 2025-04-26 === * 09:30 "anjali9027" was rejected (pending since 2025-01-25T09:28:07.064Z). === 2025-04-25 === * 18:00 "salahhazaa" was rejected (pending since 2025-01-24T17:58:30.030Z). * 15:15 [[gitlab:yiming|@yiming]] was approved. * 02:06 "mrchanmp" was rejected (pending since 2025-01-24T02:03:58.308Z). === 2025-04-23 === * 17:03 "rj2904" was rejected (pending since 2025-01-22T17:03:11.207Z). * 14:21 "nischay33" was rejected (pending since 2025-01-22T14:19:21.081Z). === 2025-04-22 === * 19:27 "dj80" was rejected (pending since 2025-01-21T19:25:28.498Z). * 14:30 [[gitlab:kaimamin|@kaimamin]] was approved. * 09:57 "debo" was rejected (pending since 2025-01-21T09:54:47.955Z). === 2025-04-21 === * 12:24 "unshell" was rejected (pending since 2025-01-20T12:21:59.686Z). === 2025-04-18 === * 15:06 [[gitlab:spartanarbinger|@spartanarbinger]] was approved. === 2025-04-16 === * 03:09 "dewey" was rejected (pending since 2025-01-15T03:06:17.488Z). === 2025-04-15 === * 19:45 "emdadul" was rejected (pending since 2025-01-14T19:42:29.285Z). === 2025-04-14 === * 06:45 [[gitlab:bcampbell804|@bcampbell804]] was approved. === 2025-04-11 === * 06:27 [[gitlab:jvanderhoop|@jvanderhoop]] was approved. === 2025-04-10 === * 04:12 "bhai420" was rejected (pending since 2025-01-09T04:10:29.430Z). === 2025-04-09 === * 05:03 "austinvarshney" was rejected (pending since 2025-01-08T05:02:34.175Z). === 2025-04-06 === * 15:36 [[gitlab:elph|@elph]] was approved. === 2025-04-02 === * 10:33 [[gitlab:ozge|@ozge]] was approved. === 2025-03-31 === * 20:15 "demandkey" was rejected (pending since 2024-12-30T20:14:23.096Z). * 15:18 [[gitlab:danyya|@danyya]] was approved. === 2025-03-28 === * 15:54 [[gitlab:rutsavi09|@rutsavi09]] was approved. * 15:54 [[gitlab:ilanen1|@ilanen1]] was approved. === 2025-03-25 === * 19:27 [[gitlab:irfo|@irfo]] was approved. * 11:54 [[gitlab:kmontalva-wmf|@kmontalva-wmf]] was approved. * 04:33 [[gitlab:paul26|@paul26]] was approved. * 04:18 "as1100k" was rejected (pending since 2024-12-24T04:18:06.813Z). === 2025-03-24 === * 11:33 "amzadkhankk" was rejected (pending since 2024-12-23T11:33:14.176Z). === 2025-03-23 === * 12:24 "wolfdo" was rejected (pending since 2024-12-22T12:23:35.056Z). === 2025-03-22 === * 09:45 [[gitlab:fjmustak|@fjmustak]] was approved. === 2025-03-20 === * 18:42 "sathishkokila" was rejected (pending since 2024-12-19T18:39:35.161Z). * 17:03 [[gitlab:alien4444|@alien4444]] was approved. * 15:27 [[gitlab:davidcoronel|@davidcoronel]] was approved. === 2025-03-19 === * 22:57 [[gitlab:r1f4t|@r1f4t]] was approved. * 19:03 "daniel24ps" was rejected (pending since 2024-12-18T19:00:21.249Z). * 14:18 [[gitlab:beepbooppenguin|@beepbooppenguin]] was approved. === 2025-03-18 === * 17:48 "rahulkundu1209" was rejected (pending since 2024-12-17T17:46:41.936Z). * 08:15 "kirtisikka972" was rejected (pending since 2024-12-17T08:13:25.487Z). === 2025-03-15 === * 13:30 "tulspal_sidhu" was rejected (pending since 2024-12-14T13:29:10.606Z). * 01:39 "peacedeadc" was rejected (pending since 2024-12-14T01:37:36.579Z). === 2025-03-14 === * 03:51 [[gitlab:chuckthebuck|@chuckthebuck]] was approved. * 02:33 "yxngtrtxll" was rejected (pending since 2024-12-13T02:31:51.658Z). === 2025-03-13 === * 14:36 [[gitlab:iccander|@iccander]] was approved. === 2025-03-12 === * 23:21 "jokerchic36" was rejected (pending since 2024-12-11T23:21:00.670Z). * 15:30 [[gitlab:naomi|@naomi]] was approved. * 15:27 [[gitlab:cobi|@cobi]] was approved. === 2025-03-11 === * 12:42 "mohitvermaxx" was rejected (pending since 2024-12-10T12:40:56.967Z). === 2025-03-10 === * 16:51 [[gitlab:nanona15dobato|@nanona15dobato]] was approved. === 2025-03-09 === * 22:39 [[gitlab:jonkolbert|@jonkolbert]] was approved. * 20:45 [[gitlab:urbanecmtest2|@urbanecmtest2]] was approved. === 2025-03-07 === * 16:54 [[gitlab:hswan|@hswan]] was approved. * 14:42 [[gitlab:atitkov|@atitkov]] was approved. * 00:42 [[gitlab:infrastruktur|@infrastruktur]] was approved. === 2025-03-06 === * 17:21 "johnmann" was rejected (pending since 2024-12-05T17:19:24.995Z). === 2025-03-05 === * 07:33 [[gitlab:monx9494|@monx9494]] was approved. === 2025-03-02 === * 21:21 "paul26" was rejected (pending since 2024-12-01T21:20:19.681Z). === 2025-03-01 === * 19:15 [[gitlab:izno|@izno]] was approved. * 12:45 [[gitlab:nyerho|@nyerho]] was approved. === 2025-02-28 === * 18:27 [[gitlab:chuckonwumelu|@chuckonwumelu]] was approved. * 13:09 "ashwinpraveengo" was rejected (pending since 2024-11-29T13:07:47.240Z). * 00:18 "eduardoaugusto" was rejected (pending since 2024-11-29T00:17:43.372Z). === 2025-02-27 === * 20:39 "volkanurl" was rejected (pending since 2024-11-28T20:37:18.101Z). === 2025-02-24 === * 21:15 [[gitlab:feeglgeef|@feeglgeef]] was approved. * 20:18 [[gitlab:piaanalysis2|@piaanalysis2]] was approved. * 19:06 [[gitlab:dhardy|@dhardy]] was approved. === 2025-02-22 === * 19:27 [[gitlab:owuh|@owuh]] was approved. === 2025-02-19 === * 16:06 [[gitlab:artemkloko|@artemkloko]] was approved. * 13:03 [[gitlab:jgafnea|@jgafnea]] was approved. === 2025-02-17 === * 16:33 [[gitlab:asmartkitten|@asmartkitten]] was approved. === 2025-02-16 === * 19:12 "gaurigupta21" was rejected (pending since 2024-11-17T19:11:07.416Z). === 2025-02-15 === * 01:18 [[gitlab:mediawiki-quickstart-ci|@mediawiki-quickstart-ci]] was approved. === 2025-02-14 === * 15:21 "nathanbnm" was rejected (pending since 2024-11-15T15:18:19.632Z). === 2025-02-13 === * 16:45 [[gitlab:priyanshuchahal|@priyanshuchahal]] was approved. * 16:42 [[gitlab:ajhalili2006|@ajhalili2006]] was approved. === 2025-02-12 === * 23:21 "monkeypatch999" was rejected (pending since 2024-11-13T23:20:38.398Z). * 06:36 [[gitlab:jainlakshita28|@jainlakshita28]] was approved. === 2025-02-11 === * 19:27 [[gitlab:matthewsm2|@matthewsm2]] was approved. === 2025-02-09 === * 16:15 "mohammed_abukhadra" was rejected (pending since 2024-11-10T16:15:18.361Z). === 2025-02-07 === * 21:33 "brennan" was rejected (pending since 2024-11-08T21:31:07.351Z). === 2025-02-06 === * 08:24 "mmta" was rejected (pending since 2024-11-07T08:22:36.724Z). * 06:21 [[gitlab:bunnypranav|@bunnypranav]] was approved. === 2025-02-05 === * 22:39 "chrissteinchen" was rejected (pending since 2024-11-06T22:38:16.673Z). === 2025-02-03 === * 07:45 "edriiic" was rejected (pending since 2024-11-04T07:44:46.849Z). * 01:12 "geppy" was rejected (pending since 2024-11-04T01:10:48.710Z). === 2025-02-02 === * 13:18 "funa-enpitu" was rejected (pending since 2024-11-03T13:15:46.065Z). === 2025-01-31 === * 23:42 "nfontes" was rejected (pending since 2024-11-01T23:39:41.755Z). * 22:51 "sbronson" was rejected (pending since 2024-11-01T22:50:31.871Z). * 00:42 [[gitlab:farid|@farid]] was approved. === 2025-01-27 === * 08:15 [[gitlab:eliza189|@eliza189]] was approved. === 2025-01-25 === * 09:51 [[gitlab:pamputt|@pamputt]] was approved. === 2025-01-23 === * 14:30 [[gitlab:lubianat|@lubianat]] was approved. * 11:45 [[gitlab:bootsa|@bootsa]] was approved. === 2025-01-21 === * 05:09 "niko" was rejected (pending since 2024-07-21T16:10:01.377Z). * 05:09 "thawizkid369777" was rejected (pending since 2024-07-18T17:42:44.493Z). * 05:09 "sarthaksingh2" was rejected (pending since 2024-07-10T11:31:30.470Z). * 05:09 "shriyakt" was rejected (pending since 2024-07-06T04:54:10.248Z). * 05:09 "akshaya" was rejected (pending since 2024-07-06T04:04:51.488Z). * 05:09 "alaka03aj" was rejected (pending since 2024-07-05T18:01:54.876Z). * 05:09 "sulochanaviji-5049" was rejected (pending since 2024-07-01T05:58:00.427Z). * 05:09 "nayanjnath" was rejected (pending since 2024-07-01T02:51:57.405Z). * 05:09 "sd44" was rejected (pending since 2024-06-30T04:28:51.436Z). * 05:09 "metavalent" was rejected (pending since 2024-06-29T01:37:14.210Z). * 05:09 "wicloudx" was rejected (pending since 2024-06-28T11:51:23.335Z). * 05:09 "debo" was rejected (pending since 2024-06-28T01:44:59.845Z). * 05:09 "bwiki" was rejected (pending since 2024-06-23T14:15:38.032Z). * 05:09 "toprak" was rejected (pending since 2024-06-23T11:35:50.819Z). * 05:09 "iristeller" was rejected (pending since 2024-06-14T20:53:48.959Z). * 05:09 "jcolvin" was rejected (pending since 2024-06-12T17:29:01.238Z). * 05:09 "kalyan" was rejected (pending since 2024-06-07T07:52:46.993Z). * 05:09 "bluecrystal" was rejected (pending since 2024-06-06T19:16:20.107Z). * 05:09 "iftttrohit" was rejected (pending since 2024-06-04T12:08:50.818Z). * 05:09 "pogpotato" was rejected (pending since 2024-06-03T17:58:21.684Z). * 05:09 "cptlausebaer" was rejected (pending since 2024-05-31T18:53:27.692Z). * 05:09 "hdevine825" was rejected (pending since 2024-05-31T17:04:18.279Z). * 05:09 "anaghaa18" was rejected (pending since 2024-05-25T19:14:31.803Z). * 05:09 "atharvanair04" was rejected (pending since 2024-05-25T14:24:52.825Z). * 05:09 "anasvemmully" was rejected (pending since 2024-05-25T06:10:27.261Z). * 05:09 "abhinavmohandas" was rejected (pending since 2024-05-25T06:05:24.825Z). * 05:09 "kksurendran06" was rejected (pending since 2024-05-25T06:04:38.082Z). * 05:09 "albertmarshall8896" was rejected (pending since 2024-05-23T09:32:05.462Z). * 05:09 "akellison" was rejected (pending since 2024-05-17T02:07:24.229Z). * 05:09 "mainowill" was rejected (pending since 2024-04-16T23:30:33.881Z). * 05:09 "bzhqc" was rejected (pending since 2024-04-16T19:50:38.676Z). * 05:09 "safan41" was rejected (pending since 2024-04-16T03:34:48.942Z). * 05:09 "mgagat" was rejected (pending since 2024-04-16T03:21:51.764Z). * 05:09 "okeamah" was rejected (pending since 2024-04-16T02:49:00.143Z). * 05:09 "xuhao61" was rejected (pending since 2024-04-15T23:45:09.083Z). * 04:47 "cybel" was rejected (pending since 2024-04-15T06:46:35.791Z). === 2025-01-20 === * 14:33 [[gitlab:your1|@your1]] was approved. === 2025-01-18 === * 10:09 [[gitlab:galrach600|@galrach600]] was approved. * 02:51 [[gitlab:blankeclair|@blankeclair]] was approved. === 2025-01-17 === * 13:57 [[gitlab:dsantamaria|@dsantamaria]] was approved. === 2025-01-15 === * 17:12 [[gitlab:smartse|@smartse]] was approved. === 2025-01-14 === * 17:03 [[gitlab:naorleizer|@naorleizer]] was approved. === 2025-01-13 === * 02:45 [[gitlab:wolf20482|@wolf20482]] was approved. === 2025-01-12 === * 17:45 [[gitlab:tamzin|@tamzin]] was approved. === 2025-01-11 === * 15:24 [[gitlab:bargioni|@bargioni]] was approved. * 14:30 [[gitlab:salelya|@salelya]] was approved. * 10:15 [[gitlab:malakatshy|@malakatshy]] was approved. * 05:21 [[gitlab:newmcpee|@newmcpee]] was approved. === 2025-01-09 === * 15:30 [[gitlab:gkyziridis|@gkyziridis]] was approved. === 2025-01-08 === * 16:21 [[gitlab:ukrface|@ukrface]] was approved. === 2024-12-28 === * 03:27 [[gitlab:twonum|@twonum]] was approved. === 2024-12-25 === * 06:09 [[gitlab:harsv567|@harsv567]] was approved. === 2024-12-21 === * 11:24 [[gitlab:amutha2002|@amutha2002]] was approved. === 2024-12-20 === * 19:51 [[gitlab:hridyeshgupta|@hridyeshgupta]] was approved. * 10:00 [[gitlab:ro-shines|@ro-shines]] was approved. * 08:09 [[gitlab:kesharwaniarpita|@kesharwaniarpita]] was approved. === 2024-12-18 === * 14:45 [[gitlab:soylacarli|@soylacarli]] was approved. === 2024-12-16 === * 20:33 [[gitlab:aleyasiddika1|@aleyasiddika1]] was approved. === 2024-12-15 === * 07:33 [[gitlab:abhishek02bhardwaj|@abhishek02bhardwaj]] was approved. === 2024-12-13 === * 13:18 [[gitlab:ashmitabathre204|@ashmitabathre204]] was approved. === 2024-12-10 === * 06:39 [[gitlab:ginaan|@ginaan]] was approved. === 2024-12-09 === * 05:45 [[gitlab:kallinavya|@kallinavya]] was approved. * 00:54 [[gitlab:viserion-7|@viserion-7]] was approved. === 2024-12-08 === * 17:27 [[gitlab:wargo|@wargo]] was approved. === 2024-12-05 === * 11:15 [[gitlab:ranjithraj|@ranjithraj]] was approved. === 2024-12-02 === * 21:21 [[gitlab:a930913|@a930913]] was approved. === 2024-12-01 === * 02:39 [[gitlab:kingchristlike1|@kingchristlike1]] was approved. === 2024-11-21 === * 13:45 [[gitlab:sascha|@sascha]] was approved. === 2024-11-19 === * 16:36 [[gitlab:jly|@jly]] was approved. === 2024-11-15 === * 02:54 [[gitlab:danielyepezgarces|@danielyepezgarces]] was approved. === 2024-11-14 === * 14:15 [[gitlab:stimoroll|@stimoroll]] was approved. === 2024-11-09 === * 17:15 [[gitlab:f4udeveloper|@f4udeveloper]] was approved. === 2024-11-07 === * 19:15 [[gitlab:zulf|@zulf]] was approved. * 05:33 [[gitlab:hassanamin|@hassanamin]] was approved. === 2024-11-06 === * 19:39 [[gitlab:daniuu|@daniuu]] was approved. * 00:18 [[gitlab:rlopez-wmf|@rlopez-wmf]] was approved. === 2024-10-09 === * 14:45 [[gitlab:jtweed|@jtweed]] was approved. * 10:24 [[gitlab:ifrahkh|@ifrahkh]] was approved. * 09:06 [[gitlab:wikibayer|@wikibayer]] was approved. === 2024-10-06 === * 10:27 [[gitlab:keerthan16|@keerthan16]] was approved. === 2024-10-04 === * 07:45 [[gitlab:hakimi97|@hakimi97]] was approved. === 2024-09-30 === * 07:39 [[gitlab:ninjastrikers|@ninjastrikers]] was approved. === 2024-09-28 === * 17:30 [[gitlab:webrunner95|@webrunner95]] was approved. === 2024-09-18 === * 21:39 [[gitlab:elliottetzkorn|@elliottetzkorn]] was approved. === 2024-09-14 === * 22:06 [[gitlab:humptydumpty|@humptydumpty]] was approved. === 2024-09-06 === * 08:48 [[gitlab:mickabarber|@mickabarber]] was approved. === 2024-08-27 === * 17:36 [[gitlab:edgars|@edgars]] was approved. === 2024-08-22 === * 09:18 [[gitlab:antonkokhwmde|@antonkokhwmde]] was approved. === 2024-08-14 === * 19:21 [[gitlab:jfk|@jfk]] was approved. === 2024-08-13 === * 17:57 [[gitlab:daxserver|@daxserver]] was approved. === 2024-08-11 === * 09:57 [[gitlab:pauliesnug|@pauliesnug]] was approved. === 2024-08-10 === * 08:42 [[gitlab:ashig|@ashig]] was approved. === 2024-08-09 === * 14:09 [[gitlab:masssly|@masssly]] was approved. === 2024-08-05 === * 22:15 [[gitlab:mrtortue|@mrtortue]] was approved. === 2024-08-02 === * 16:21 [[gitlab:dsantini|@dsantini]] was approved. === 2024-07-31 === * 11:54 [[gitlab:cptviraj|@cptviraj]] was approved. === 2024-07-30 === * 19:09 [[gitlab:iniquity|@iniquity]] was approved. * 10:00 [[gitlab:collins|@collins]] was approved. === 2024-07-27 === * 15:57 [[gitlab:songnguxyz|@songnguxyz]] was approved. === 2024-07-25 === * 12:36 [[gitlab:mszabo|@mszabo]] was approved. * 09:21 [[gitlab:agarwalmahima|@agarwalmahima]] was approved. === 2024-07-24 === * 08:05 [[gitlab:dragoniez|@dragoniez]] was approved. === 2024-07-23 === * 06:54 [[gitlab:mirji|@mirji]] was approved. === 2024-07-16 === * 10:00 [[gitlab:lakejason0|@lakejason0]] was approved. === 2024-07-12 === * 11:33 [[gitlab:cn|@cn]] was approved. * 08:12 [[gitlab:unchampignon|@unchampignon]] was approved. === 2024-07-07 === * 17:12 [[gitlab:agamyasamuel|@agamyasamuel]] was approved. * 05:24 [[gitlab:kuldeepburjbhalaike|@kuldeepburjbhalaike]] was approved. === 2024-07-06 === * 11:18 [[gitlab:dibya|@dibya]] was approved. * 04:54 [[gitlab:sarthakparashar|@sarthakparashar]] was approved. === 2024-07-05 === * 18:15 [[gitlab:vanshikarathi|@vanshikarathi]] was approved. === 2024-07-02 === * 19:00 [[gitlab:ebrahim|@ebrahim]] was approved. === 2024-07-01 === * 20:12 [[gitlab:rockingpenny4|@rockingpenny4]] was approved. * 18:15 [[gitlab:balajijagadesh|@balajijagadesh]] was approved. === 2024-06-30 === * 18:24 [[gitlab:hrideshmg|@hrideshmg]] was approved. * 07:18 [[gitlab:chanakyakumardas|@chanakyakumardas]] was approved. * 06:30 [[gitlab:rihaan180|@rihaan180]] was approved. === 2024-06-27 === * 17:36 [[gitlab:driedmueller|@driedmueller]] was approved. === 2024-06-19 === * 12:57 [[gitlab:audreypenven|@audreypenven]] was approved. === 2024-06-16 === * 01:18 [[gitlab:roysmith|@roysmith]] was approved. === 2024-06-08 === * 02:45 [[gitlab:jleedev|@jleedev]] was approved. === 2024-06-03 === * 13:57 [[gitlab:afeder|@afeder]] was approved. === 2024-06-01 === * 10:54 [[gitlab:florianschmitt|@florianschmitt]] was approved. === 2024-05-30 === * 16:42 [[gitlab:krlsca|@krlsca]] was approved. === 2024-05-28 === * 11:24 [[gitlab:rickijay|@rickijay]] was approved. === 2024-05-26 === * 11:18 [[gitlab:ranjithsiji|@ranjithsiji]] was approved. === 2024-05-25 === * 07:24 [[gitlab:jony|@jony]] was approved. === 2024-05-23 === * 08:45 [[gitlab:lepticed7|@lepticed7]] was approved. === 2024-05-22 === * 20:42 [[gitlab:echecs|@echecs]] was approved. === 2024-05-21 === * 13:33 [[gitlab:mbs|@mbs]] was approved. === 2024-05-19 === * 18:06 [[gitlab:ionenlaser|@ionenlaser]] was approved. === 2024-05-18 === * 23:36 [[gitlab:mdaniels5757|@mdaniels5757]] was approved. === 2024-05-17 === * 08:54 [[gitlab:grapedog|@grapedog]] was approved. === 2024-05-08 === * 19:42 [[gitlab:kelhurd|@kelhurd]] was approved. * 19:06 [[gitlab:khurd|@khurd]] was approved. === 2024-05-06 === * 19:48 [[gitlab:j3j5|@j3j5]] was approved. * 12:06 [[gitlab:tk-999|@tk-999]] was approved. === 2024-05-05 === * 22:09 [[gitlab:pppery|@pppery]] was approved. * 20:33 [[gitlab:sakretsu|@sakretsu]] was approved. * 12:12 [[gitlab:waterquark|@waterquark]] was approved. === 2024-05-04 === * 09:03 [[gitlab:multichill|@multichill]] was approved. * 07:42 [[gitlab:abaris|@abaris]] was approved. === 2024-05-03 === * 14:57 [[gitlab:maurusian|@maurusian]] was approved. === 2024-04-24 === * 05:48 [[gitlab:wolfinux|@wolfinux]] was approved. === 2024-04-23 === * 15:48 [[gitlab:dreamrimmer|@dreamrimmer]] was approved. === 2024-04-21 === * 06:51 [[gitlab:alon|@alon]] was approved. === 2024-04-17 === * 23:33 [[gitlab:derenrich|@derenrich]] was approved. === 2024-04-16 === * 17:18 [[gitlab:valcio|@valcio]] was approved. === 2024-04-14 === * 16:51 [[gitlab:wikilucas00|@wikilucas00]] was approved. === 2024-04-06 === * 12:48 [[gitlab:theprotonade|@theprotonade]] was approved. === 2024-04-02 === * 07:30 [[gitlab:bohuizhang|@bohuizhang]] was approved. === 2024-03-30 === * 13:36 [[gitlab:lpintscher|@lpintscher]] was approved. === 2024-03-26 === * 17:09 [[gitlab:eenabulele|@eenabulele]] was approved. === 2024-03-25 === * 14:27 [[gitlab:tuukka|@tuukka]] was approved. === 2024-03-24 === * 12:24 [[gitlab:firefly|@firefly]] was approved. === 2024-03-21 === * 19:33 [[gitlab:universal-omega|@universal-omega]] was approved. === 2024-03-17 === * 10:36 [[gitlab:bisel91|@bisel91]] was approved. === 2024-03-16 === * 10:09 [[gitlab:delord|@delord]] was approved. * 00:42 [[gitlab:athulvis1|@athulvis1]] was approved. === 2024-03-15 === * 19:06 [[gitlab:ignaciorodrguez|@ignaciorodrguez]] was approved. * 08:30 [[gitlab:peachey88|@peachey88]] was approved. * 06:51 [[gitlab:derick|@derick]] was approved. === 2024-03-12 === * 15:06 [[gitlab:xiaoxiao|@xiaoxiao]] was approved. === 2024-03-06 === * 13:21 [[gitlab:desianabae1|@desianabae1]] was approved. === 2024-03-05 === * 19:21 [[gitlab:ep1c|@ep1c]] was approved. * 16:33 [[gitlab:jasmine|@jasmine]] was approved. === 2024-03-02 === * 06:42 [[gitlab:potsdamlamb|@potsdamlamb]] was approved. === 2024-02-29 === * 23:18 [[gitlab:arandomname123|@arandomname123]] was approved. * 18:03 [[gitlab:baba|@baba]] was approved. * 17:48 [[gitlab:yfdyh000|@yfdyh000]] was approved. * 03:09 [[gitlab:sds|@sds]] was approved. === 2024-02-27 === * 23:33 [[gitlab:lofhi|@lofhi]] was approved. === 2024-02-15 === * 19:45 [[gitlab:gergesshamon|@gergesshamon]] was approved. === 2024-02-14 === * 14:33 [[gitlab:philipnelson99|@philipnelson99]] was approved. === 2024-02-13 === * 13:06 [[gitlab:dringsim|@dringsim]] was approved. === 2024-02-12 === * 17:36 [[gitlab:haak|@haak]] was approved. === 2024-02-05 === * 17:33 [[gitlab:qwerfjkl|@qwerfjkl]] was approved. * 17:14 [[gitlab:ahecht|@ahecht]] was approved. === 2024-02-01 === * 09:27 [[gitlab:arinaigum|@arinaigum]] was approved. * 00:15 [[gitlab:jas42|@jas42]] was approved. * 00:15 [[gitlab:edhu|@edhu]] was approved. * 00:15 [[gitlab:marnanel|@marnanel]] was approved. * 00:15 [[gitlab:ibrahemqasim|@ibrahemqasim]] was approved. * 00:15 [[gitlab:amasotti|@amasotti]] was approved. * 00:15 [[gitlab:deni|@deni]] was approved. * 00:15 [[gitlab:cyber|@cyber]] was approved. * 00:15 [[gitlab:saroj|@saroj]] was approved. === 2024-01-29 === * 21:42 [[gitlab:rgupta|@rgupta]] was approved. === 2024-01-07 === * 09:48 [[gitlab:lutrome|@lutrome]] was approved. === 2024-01-05 === * 20:48 [[gitlab:jinoytommanjaly|@jinoytommanjaly]] was approved. * 02:51 [[gitlab:braunobruno|@braunobruno]] was approved. * 01:08 [[gitlab:amorymeltzer|@amorymeltzer]] was approved. * 01:08 [[gitlab:phi22ipus|@phi22ipus]] was approved. === 2024-01-03 === * 14:45 [[gitlab:gabina|@gabina]] was approved. === 2024-01-02 === * 13:18 [[gitlab:arthurtaylor|@arthurtaylor]] was approved. === 2023-12-23 === * 00:33 [[gitlab:aram|@aram]] was approved. === 2023-12-22 === * 16:24 [[gitlab:elpitareio|@elpitareio]] was approved. === 2023-12-21 === * 00:43 [[gitlab:bsadowski1|@bsadowski1]] was approved. * 00:43 [[gitlab:ederporto|@ederporto]] was approved. * 00:43 [[gitlab:sadraiiali|@sadraiiali]] was approved. * 00:43 [[gitlab:wasp-outis|@wasp-outis]] was approved. * 00:43 [[gitlab:bodhisattwa|@bodhisattwa]] was approved. * 00:43 [[gitlab:air7538|@air7538]] was approved. * 00:43 [[gitlab:anzx|@anzx]] was approved. * 00:43 [[gitlab:tekask1903|@tekask1903]] was approved. * 00:42 [[gitlab:kiwi-0x010c|@kiwi-0x010c]] was approved. * 00:42 [[gitlab:mpaa|@mpaa]] was approved. * 00:42 [[gitlab:kutay|@kutay]] was approved. * 00:42 [[gitlab:wattmto|@wattmto]] was approved. 97zfdtw3d1dac274voolhk2ctbg2zwu Nova Resource:Tools.wdactle/SAL 498 458722 2309627 2306835 2025-06-08T14:42:34Z Stashbot 7414 wmbot~lucaswerkmeister@tools-bastion-13: deployed 7c1eaba398 (split guesses into words) 2309627 wikitext text/x-wiki === 2025-06-08 === * 14:42 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|7c1eaba398}} (split guesses into words) === 2025-05-29 === * 15:27 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|894d8b3ef5}} (three minor improvements) * 14:36 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|3ad25bd9e0}} (support globe coordinates – all value types now supported 🎉) * 14:10 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|2db51d07e9}} (support for most time values) * 13:15 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|1b4ccaff9c}} (focus guess input on key press) * 12:55 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b270cde644}} (very rudimentary lexeme support) === 2025-05-27 === * 19:46 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|0fa08af28b}} (“I give up” feature) === 2025-05-26 === * 18:20 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|8f20eca8ac}} (scroll to guess on click) === 2025-05-25 === * 12:25 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|8e2d706ed2}} (mobile design again, just put the guessing area at the top) === 2025-05-23 === * 16:47 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|df1752303e}} (use action=wbformatentities with generate=text/plain, cc [[phab:T393691|T393691]]) === 2025-05-12 === * 18:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|9f541762ae}} (another mobile design attempt, this time with position: sticky) * 18:39 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|93a2898396}} (fix EntitySchema crash) === 2025-05-09 === * 20:33 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|97b7de4ece}} (improved victory message) * 19:44 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b8820a19ce}} (eslint improvements) * 18:01 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|c26a336aa8}} (better viewport units for mobile, hopefully) === 2025-05-08 === * 15:59 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b6a2243821}} (better mobile layout) * 15:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|7ef1b88fa0}} (logical viewport lengths) [an hour ago, the initial dologmsg failed and I didn’t notice] * 14:43 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|054a3064ca}} (minor code fixes, setting up ESLint and GitLab CI; should have no functional changes) === 2025-05-07 === * 00:04 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|71d995277d}} (show errors, improve unsupported data type styles) === 2025-05-06 === * 20:34 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|1674729ca9}} (use MediaWiki language info, fix build message) === 2025-05-04 === * 13:00 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|76cab41655}} (progress bar for loading, layout tweaks) * 11:04 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b86d170f9f}} (better spacing) * 10:55 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|f1a15413cd}} (CdxDialog on win) * 09:02 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|df2f600b86}} (prevent empty guesses) * 08:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|30ffddd793}} (?uselang= URL parameter, better user agent) * 07:49 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|f6e45ecf90}} (?entityId= and ?query= URL parameters, each with a few aliases too) === 2025-05-03 === * 23:17 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|fc44627b5b}} (improve word splitting, reduce contrast) * 21:30 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|d4b39b68a6}} (initial deployment #wmhack) <noinclude>[[Category:SAL]]</noinclude> lh54r5q27mayftxfcwqhtg5jyhznzkh 2309630 2309627 2025-06-08T15:40:55Z Stashbot 7414 wmbot~lucaswerkmeister@tools-bastion-13: deployed 168c371259 (refactoring: labels store, should have no effect) 2309630 wikitext text/x-wiki === 2025-06-08 === * 15:40 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|168c371259}} (refactoring: labels store, should have no effect) * 14:42 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|7c1eaba398}} (split guesses into words) === 2025-05-29 === * 15:27 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|894d8b3ef5}} (three minor improvements) * 14:36 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|3ad25bd9e0}} (support globe coordinates – all value types now supported 🎉) * 14:10 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|2db51d07e9}} (support for most time values) * 13:15 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|1b4ccaff9c}} (focus guess input on key press) * 12:55 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b270cde644}} (very rudimentary lexeme support) === 2025-05-27 === * 19:46 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|0fa08af28b}} (“I give up” feature) === 2025-05-26 === * 18:20 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|8f20eca8ac}} (scroll to guess on click) === 2025-05-25 === * 12:25 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|8e2d706ed2}} (mobile design again, just put the guessing area at the top) === 2025-05-23 === * 16:47 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|df1752303e}} (use action=wbformatentities with generate=text/plain, cc [[phab:T393691|T393691]]) === 2025-05-12 === * 18:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|9f541762ae}} (another mobile design attempt, this time with position: sticky) * 18:39 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|93a2898396}} (fix EntitySchema crash) === 2025-05-09 === * 20:33 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|97b7de4ece}} (improved victory message) * 19:44 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b8820a19ce}} (eslint improvements) * 18:01 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|c26a336aa8}} (better viewport units for mobile, hopefully) === 2025-05-08 === * 15:59 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b6a2243821}} (better mobile layout) * 15:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|7ef1b88fa0}} (logical viewport lengths) [an hour ago, the initial dologmsg failed and I didn’t notice] * 14:43 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|054a3064ca}} (minor code fixes, setting up ESLint and GitLab CI; should have no functional changes) === 2025-05-07 === * 00:04 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|71d995277d}} (show errors, improve unsupported data type styles) === 2025-05-06 === * 20:34 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|1674729ca9}} (use MediaWiki language info, fix build message) === 2025-05-04 === * 13:00 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|76cab41655}} (progress bar for loading, layout tweaks) * 11:04 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|b86d170f9f}} (better spacing) * 10:55 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|f1a15413cd}} (CdxDialog on win) * 09:02 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|df2f600b86}} (prevent empty guesses) * 08:54 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|30ffddd793}} (?uselang= URL parameter, better user agent) * 07:49 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|f6e45ecf90}} (?entityId= and ?query= URL parameters, each with a few aliases too) === 2025-05-03 === * 23:17 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|fc44627b5b}} (improve word splitting, reduce contrast) * 21:30 wmbot~lucaswerkmeister@tools-bastion-13: deployed {{Gerrit|d4b39b68a6}} (initial deployment #wmhack) <noinclude>[[Category:SAL]]</noinclude> 3w49tn4joafs4hnmq53y7ayq28bo35q User talk:ReggyCelly 3 458831 2309668 2025-06-09T09:05:35Z StrikerBot 8475 Welcome to Toolforge! 2309668 wikitext text/x-wiki == Welcome to Toolforge! == Hello ReggyCelly, welcome to the Toolforge project! Your request for access was processed, and you should be able to use ssh to connect to <tt>login.toolforge.org</tt>. You will need to logout and login again at https://toolsadmin.wikimedia.org/ to activate your new permissions there. Check the [[Help:Toolforge|Toolforge help page]] for tips on using your account. You can also ask questions in our IRC channel at {{irc|wikimedia-cloud}} or send an e-mail to our mailing list <tt>cloud@lists.wikimedia.org</tt>. Thank you, and have fun making Tools! --[[User:StrikerBot|StrikerBot]] ([[User talk:StrikerBot|talk]]) 09:05, 9 June 2025 (UTC) cq8ervpcnsvnmlvl7wrf6mqc3fcegw6 User talk:Sumitsurai 3 458832 2309674 2025-06-09T09:26:10Z StrikerBot 8475 Welcome to Toolforge! 2309674 wikitext text/x-wiki == Welcome to Toolforge! == Hello Sumitsurai, welcome to the Toolforge project! Your request for access was processed, and you should be able to use ssh to connect to <tt>login.toolforge.org</tt>. You will need to logout and login again at https://toolsadmin.wikimedia.org/ to activate your new permissions there. Check the [[Help:Toolforge|Toolforge help page]] for tips on using your account. You can also ask questions in our IRC channel at {{irc|wikimedia-cloud}} or send an e-mail to our mailing list <tt>cloud@lists.wikimedia.org</tt>. Thank you, and have fun making Tools! --[[User:StrikerBot|StrikerBot]] ([[User talk:StrikerBot|talk]]) 09:26, 9 June 2025 (UTC) eh0fhyco3z7gwegzw2209w3lp4kgtn3 User talk:Mmta 3 458833 2309678 2025-06-09T09:33:01Z StrikerBot 8475 Welcome to Toolforge! 2309678 wikitext text/x-wiki == Welcome to Toolforge! == Hello Mmta, welcome to the Toolforge project! Your request for access was processed, and you should be able to use ssh to connect to <tt>login.toolforge.org</tt>. You will need to logout and login again at https://toolsadmin.wikimedia.org/ to activate your new permissions there. Check the [[Help:Toolforge|Toolforge help page]] for tips on using your account. You can also ask questions in our IRC channel at {{irc|wikimedia-cloud}} or send an e-mail to our mailing list <tt>cloud@lists.wikimedia.org</tt>. Thank you, and have fun making Tools! --[[User:StrikerBot|StrikerBot]] ([[User talk:StrikerBot|talk]]) 09:33, 9 June 2025 (UTC) 6gm1eq43jejkhobd9dmuyfecv6gbru0 User talk:Rizkynat404 3 458834 2309685 2025-06-09T09:44:49Z StrikerBot 8475 Welcome to Toolforge! 2309685 wikitext text/x-wiki == Welcome to Toolforge! == Hello Rizkynat404, welcome to the Toolforge project! Your request for access was processed, and you should be able to use ssh to connect to <tt>login.toolforge.org</tt>. You will need to logout and login again at https://toolsadmin.wikimedia.org/ to activate your new permissions there. Check the [[Help:Toolforge|Toolforge help page]] for tips on using your account. You can also ask questions in our IRC channel at {{irc|wikimedia-cloud}} or send an e-mail to our mailing list <tt>cloud@lists.wikimedia.org</tt>. Thank you, and have fun making Tools! --[[User:StrikerBot|StrikerBot]] ([[User talk:StrikerBot|talk]]) 09:44, 9 June 2025 (UTC) nenhs366n8oyhx6yyyqvxdj9dlkijlb